{"title": "Thinning for Accelerating the Learning of Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4091, "page_last": 4101, "abstract": "This paper discusses one of the most fundamental issues about point processes that what is the best sampling method for point processes. We propose \\textit{thinning} as a downsampling method for accelerating the learning of point processes. We find that the thinning operation preserves the structure of intensity, and is able to estimate parameters with less time and without much loss of accuracy. Theoretical results including intensity, parameter and gradient estimation on a thinned history are presented for point processes with decouplable intensities. A stochastic optimization algorithm based on the thinned gradient is proposed. Experimental results on synthetic and real-world datasets validate the effectiveness of thinning in the tasks of parameter and gradient estimation, as well as stochastic optimization.", "full_text": "Thinning for Accelerating the Learning of\n\nPoint Processes\n\nTianbo Li, Yiping Ke\n\nSchool of Computer Science and Engineering\nNanyang Technological University, Singapore\n\ntianbo001@e.ntu.edu.sg, ypke@ntu.edu.sg\n\nAbstract\n\nThis paper discusses one of the most fundamental issues about point processes\nthat what is the best sampling method for point processes. We propose thinning\nas a downsampling method for accelerating the learning of point processes. We\n\ufb01nd that the thinning operation preserves the structure of intensity, and is able to\nestimate parameters with less time and without much loss of accuracy. Theoretical\nresults including intensity, parameter and gradient estimation on a thinned history\nare presented for point processes with decouplable intensities. A stochastic opti-\nmization algorithm based on the thinned gradient is proposed. Experimental re-\nsults on synthetic and real-world datasets validate the effectiveness of thinning in\nthe tasks of parameter and gradient estimation, as well as stochastic optimization.\n\n1\n\nIntroduction\n\nPoint processes are a powerful statistical tool for modeling event sequences and have drawn massive\nattention from the machine learning community. Point processes have been widely used in \ufb01nance\n[6], neuroscience [8], seismology [28], social network analysis [22] and many other disciplines.\nDespite their popularity, applications related to point processes are often plagued by the scalability\nissue. Some state-of-the-art models [37, 35, 33] have a time complexity of O(d2n3), where n is the\nnumber of events and d is the dimension. As the number of events increases, learning such a model\nwould be very time consuming, if not infeasible. This becomes a major obstacle in applying point\nprocesses.\nA simple strategy to address this problem is to use part of the dataset in the learning. For instance, in\nmini-batch gradient descent, the gradient is computed at each iteration using a small batch instead\nof full data. For point processes, however, to \ufb01nd a suitable sampling method is not a easy task, at\nleast not as easily as it might seem. This is due to the special input data \u2014 event sequences. First\nof all, event sequences are posets. An inappropriate sampling methods may spoil the order structure\nof the temporal information. This is especially harmful when the intensity function depends on\nits history. Second, many models built upon point processes utilize the arrival intervals between\ntwo events. Such models are particularly useful as they take into account the interactions between\nevents or nodes. Examples include Hawkes processes and their variants [37, 33, 19]. An improper\nsampling method may change the lengths of arrival intervals, leading to a poor estimation of model\nparameters.\nA commonly-used approach to the sampling of point processes is sub-interval sampling [34, 32].\nSub-interval sampling is a piecewise sampling method, which splits an event sequence into small\npieces and learns the model on these sub-intervals. At each iteration, one or several sub-intervals\nare selected to compute the gradient. This method, however, has an intrinsic limitation: it cannot\ncapture the panoramic view of a point process. Take self-excited event sequences as an example. An\nimportant characteristic of such sequences is that events are not evenly distributed across the time\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\faxis, but tend to be clumping in a short period of time. Sub-interval sampling, in this circumstance, is\nlike \u201ca blind man appraising an elephant\u201d \u2014 it can only see part of the information at each iteration,\nprone to a large variance of the gradient.\n\nFigure 1: Comparison of sub-interval sampling (a) and thinning sampling (b).\n\nIn this paper, we discuss \u201cthinning sampling\u201d as a downsampling method for accelerating the learn-\ning of point processes. A comparison between sub-interval and thinning are shown in Figure 1.\nConventionally, thinning is a classic technique for simulating point processes [18]. We borrow the\nidea and adopt it as a downsampling method for fast learning of point processes. It is convenient to\nimplement and able to capture the entire landscape over the observation timeline.\nThe main contributions of this paper are summarized as follows.\n\u2022 To the best of our knowledge, we are the \ufb01rst to employ thinning as a downsampling method to\n\u2022 We present theoretical results for intensity, parameter and gradient estimation on the thinned\nhistory of a point process with decouplable intensities. We also apply thinning in stochastic\noptimization for the learning of point processes.\n\u2022 Experiments verify that thinning sampling can signi\ufb01cantly reduce the model learning time with-\nout much loss of accuracy, and achieve the best performance when training a Hawkes process on\nboth synthetic and real-world datasets.\n\naccelerate the learning of point processes.\n\n2 Point Processes\n\n\u2211\n\nA point process N(t) can be viewed as a random measure from a probability space (\u2126, F, P) onto a\nsimple point measure space (N, BN). We de\ufb01ne a point process N(t) as follows.\nDe\ufb01nition 2.1 (Point process). Let ti be the i-th arrival time of a point process N(t) de\ufb01ned by,\nti = inft{N(t) > i}. A point process N(t) on R+ is de\ufb01ned by N(t) =\ni \u03b4ti(t), where \u03b4\u03c9 is the\nDirac measure at \u03c9.\nThe \u201cinformation\u201d available at time t is represented by a sub-\u03c3-algebra Ht = \u03c3(N(t) : t \u2208 R+).\nThe \ufb01ltration H = (Ht)06t<\u221e is called the internal history. A point process can be characterized\nby its intensity function. It measures the probability that a point will arrive in an in\ufb01nitesimal period\nof time given the history up to the current time. Herein we follow the de\ufb01nition of stochastic intensity\nintroduced in [7, 15].\nDe\ufb01nition 2.2 (Stochastic intensity). Let N(t) be an H-adapted point process and \u03bb(t) be a non-\nnegative H-predictable process. \u03bb(t) is called the H-intensity of N(t), if for any nonegative H-\npredictable process C(t), the equality below holds,\n\nE\n\nC(s)dN(s)\n\n= E\n\nC(s)\u03bb(s)ds\n\n.\n\n0\n\n(1)\n\n(cid:20)\u222b\u221e\n\n0\n\n(cid:21)\n\n(cid:20)\u222b\u221e\n\n(cid:21)\n\n\u222b\n\nThe expectation of N(t) is called the H-compensator, which is the cumulative intensity \u039b(t) =\nt\n0 \u03bb(s)ds. Doob-Meyer decomposition yields that N(t) \u2212 \u039b(t) is an H-adapted martingale. An-\nother important result is that the conditional intensity function uniquely determines the probability\nstructure of a point process [9]. Similar results can be extended to compensators [10, 15].\n\n2\n\n(a)(b)sub-intervalsamplesthinningsamples\fM-estimator. Commonly-used parameter estimation methods for point processes include maximum\nlikelihood estimation (MLE) [22, 33, 37], intensity-based least square estimation (LSE) [3, 21] and\ncounting-based LSE [32]. These methods fall into a wider class called martingale estimator (M-\nestimator) [15, 2]. The gradient \u2207R(\u03b8), \u03b8 \u2208 Rd can be expressed as a stochastic integral: N(t),\n\n(full gradient)\n\n\u2207R(\u03b8) =\n\nH(t; \u03b8) [dN(t) \u2212 \u03bb(t; \u03b8)dt] .\n\n(2)\n\n\u222b\n\nT\n\n0\n\nHere R(\u03b8) is the loss function, which may be the log-likelihood, or the sum of the squares of\nthe residuals. \u03bb(t; \u03b8) is the intensity function to be estimated.\n[0, T ] is the observation window.\nH(t; \u03b8) is a vector-valued function, or more generally, a predictable, boundedly \ufb01nite, and square in-\ntegrable process associated with \u03bb(t; \u03b8). Different choices of H(t; \u03b8) instantiate different estimators:\nH(t; \u03b8) = \u2212\u2207 log \u03bb(t; \u03b8) for MLE, H(t; \u03b8) = \u2207\u03bb(t; \u03b8) for intensity-based LSE, and H(t; \u03b8) = 1\n\u2207R(\u03b8; \u03c9i), \u03c9i \u2208 \u2126 as the empirical gradient given real-\nfor counting-based LSE. We write\n\u2217, \u2207R(\u03b8\n\u2217\ni=1\n) is a martingale and its expectation is 0. Gradient\nizations {\u03c9i}. Under the true parameter \u03b8\n\u2207R( \u02c6\u03b8; \u03c9i) = 0. Note that in this paper,\ndescent methods are often used to \ufb01nd \u02c6\u03b8 by solving\nLSE refers to intensity-based LSE, as all the results can be easily extended to counting-based LSE.\n\n\u2211\n\n\u2211\n\ni=1\n\n3 Thinned Point Processes\n\nIn this section, we introduce a derivative process called p-thinned process. Intuitively, the p-thinned\nprocess is obtained from a point process by retaining each point in it with probability p, and dropping\nwith probability 1 \u2212 p. We formally de\ufb01ne the p-thinned process as follows.\nDe\ufb01nition 3.1 (p-thinned process). The p-thinned process Np(t) associated with a point process\nN(t) (called the ground process) is de\ufb01ned by summing up the Dirac measure on the product space\n\u2126 \u00d7 K:\n\n\u2211\n\nNp(t) =\n\ni=1\n\n\u03b4(ti,Bi=1)(t),\n\nwhere Bi\u2019s are independent Bernoulli distributed random variables with parameter p.\n\nN(t)\n\nAlternatively, the p-thinned process can be written in the form of a compound process as Np(t) =\ni=1 Bi and its differential can be expressed as dNp(t) = BN(t)dN(t). In this way, the Rie-\nmannStieltjes integral of a stochastic process with respect to a p-thinned process can be de\ufb01ned\nby,\n\n\u222b\n\nT\n\n\u222b\n\nT\n\nN(T )\u2211\n\nH(t)dNp(t) =\n\n0\n\n0\n\nH(t)BN(t)dN(t) =\n\ni=1\n\nH(ti|Hti\u2212)Bi.\n\nTwo types of histories. The differential de\ufb01ned above implies that the intensity of thinned process\ncan be written as \u03bbp(t) = p\u03bb(t). This relation between the intensities of a thinned process and\nits ground process is intuitively plausible. The implicit condition, however, is that \u03bbp(t) must be\nmeasurable with respect to the full history of N(t) and all the thinning marks Bi prior to the current\ntime t. Such a history can be expressed by the \ufb01ltration F = (Ft), where Ft = Ht \u2297 Kt and Kt\nis the cylinder \u03c3-algebra of the markers. This history is called the full history, and its corresponding\nF-intensity is denoted by \u03bbF\np (t), we need to take into account all the points\nprior to t, including those dropped ones.\nThe other type of history, called the thinned history, is the internal history of the thinned process,\ndenoted by G = (Gt), where Gt = \u03c3(Np(t)). In computing its G-intensity \u03bbG\np(t), we only need to\nconsider all the retained points of the thinned process.\nThe following lemma describes the relationship between different intensities.\nLemma 3.2 (Relationship of intensities). Let F and G be the full history and thinned history with\nrespect to a p-thinned process Np(t). Let H be the internal history of N(t). The following equalities\nhold:\n\np (t). When computing \u03bbF\n\n\u2211\n\n(cid:2)\n\n(cid:3)\n\n(1) \u03bbF\n(2) \u03bbG\n\np (t) = p\u03bbH(t);\np(t) = pE\n\n\u03bbH(t)|G\n\n.\n\n3\n\n\f(cid:2)\n\n(cid:3)\n\nDue to space limit, we defer all the proofs to the supplementary material. Lemma 3.2 tells that the\nintensity of a p-thinned process is a version of the conditional expectation E\n. Provided\nthe information of the p-thinned process, the p-thinned intensity is an orthogonal projection of that\nof the ground point process on L2. Therefore, 1/p\u03bbG\np-thinned and sub-interval gradient. We de\ufb01ne the p-thinned gradient, which is the stochastic\nintegral with respect to a p-thinned process:\n(p-thinned gradient) \u2207Rp(\u03b8) =\n\np(t) is an unbiased estimation of \u03bbH(t).\n\ndNp(t) \u2212 \u03bb\n\n\u03bbH(t)|G\n\n\u222b\n\n(cid:2)\n\n(cid:3)\n\np(t; \u03b8)dt\n\np(t; \u03b8)\n\n(3)\n\nH\n\nG\n\nG\n\n.\n\nT\n\n1\np\n\n0\n\np(t; \u03b8).\n\np(t; \u03b8) is related to \u03bbG\n\nHere HG\nSub-interval gradient is de\ufb01ned as follows. Let \u03c40, \u03c41, \u03c42, . . . , \u03c4\u23081/p\u2309 be a partition of [0, T ), where\n\u03c40 = 0, and \u03c4\u23081/p\u2309 = T. We cut the interval into \u23081/p\u2309 pieces so that the batch size is comparable\nto that in the thinned gradient. At every step, one interval is selected with probability p. We de\ufb01ne\nthe sub-interval gradient on [\u03c4i, \u03c4i+1) by,\n\u2207R\u2113(\u03b8) =\n\nI{t \u2208 [\u03c4i, \u03c4i+1)}H\n\n(sub-interval gradient)\n\nF(t; \u03b8)dt\n\ndN(t) \u2212 \u03bb\n\nF(t; \u03b8)\n\n\u222b\n\n(cid:2)\n\n(cid:3)\n\n,\n\nT\n\n1\np\n\n0\n\nwhere I is an indicator representing whether the sub-interval is selected or not. Here we consider\nthe full history. It can be easily veri\ufb01ed that \u2207R\u2113(\u03b8) is an unbiased estimation of the full gradient\nin Eq. (2). The thinned gradient can be used as a estimator of full gradient, which will be illustrated\nin Section 5. This de\ufb01nition also generalizes the stochastic optimization method proposed in [32],\nwhich splits the observation timeline at the arrival of each event.\n\n4 Thinning for Parameter Estimation\nIn this section, we discuss how to estimate the parameter \u03b8 \u2208 Rd of the intensity function \u03bb(t; \u03b8)\ngiven the thinned history G. We \ufb01rst de\ufb01ne the notations used.\n\u2022 \u03b8\nH: true parameter of H-intensity \u03bbH(t; \u03b8), such that E\u2207R(\u03b8\n\u2217\n\u2217\nH) = 0;\n\u2022 \u03b8\np(t; \u03b8), such that E\u2207Rp(\u03b8\n\u2217\n\u2217\nG: true parameter of G-intensity \u03bbG\nG) = 0;\n\u2022 \u02c6\u03b8H: estimate of \u03b8\n\u2207R( \u02c6\u03b8H; \u03c9i) = 0;\n\u2217\nH, such that\n\u2022 \u02c6\u03b8G: estimate of \u03b8\n\u2207Rp( \u02c6\u03b8G; \u03c9\n\u2217\nG, such that\n\n\u2032\ni is a realization of the p-thinned\n\n\u2032\ni) = 0, where \u03c9\n\n\u2211\n\u2211\n\ni\n\ni\n\nprocess.\n\n(cid:2)\u2207R( \u02dc\u03b8H)|G\n\n(cid:3)\n\nThe task of parameter estimation on a thinned history is to \ufb01nd \u02dc\u03b8H, such that E\nis close\nenough to 0. We refer to \u02dc\u03b8H as the M-estimator on thinned history. Here the expectation is over\nthe thinning operation. The tilde is used to indicate that \u02dc\u03b8H is a G-measurable estimator for the\nparameter of H-intensity \u03bbH(t; \u03b8), whereas \u02c6\u03b8H, with a hat on it, is H-measurable. A notable result\nP\u2212\u2192 \u03b8\n\u2217\nis that M-estimators have asymptotic normality [2], thus we have \u02c6\u03b8H\nG, as the\n\nP\u2212\u2192 \u03b8\n\u2217\nH and \u02c6\u03b8G\n\nnumber of realizations n \u2192\u221e.\n\n(cid:3)\n\n(cid:2)\u2207R( \u02dc\u03b8H)|G\n\nIn the following, we \ufb01rst present a method for parameter estimation of a non-homogeneous Poisson\nprocess (NHPP) whose intensity is deterministic. We then derive a theorem that works for a more\ngeneral type of intensities.\nLemma 4.1 (Thinning for parameter estimation of NHPP). Consider an NHPP N(t) with determin-\nistic intensity \u03bb(t; \u03b8), t > 0, \u03b8 \u2208 Rd. If there exists an invertible linear operator A : Rd \u2192 Rd satis-\nfying \u03bb(t; A\u03b8) = p\u03bb(t; \u03b8), then the M-estimator on thinned history can be written as \u02dc\u03b8H = A\u22121 \u02c6\u03b8G\nsuch that E\nExample (Parameter estimation for NHPP). Let consider an NHPP with intensity \u03bb(t; a, b, c, d) =\na + b sin(ct + d). We can \ufb01nd a diagonal matrix A = diag(p, p, 1, 1) such that \u03bb(t; A(a, b, c, d)) =\npa + pb sin(ct + d) = p\u03bb(t; a, b, c, d). Thus the parameter given the thinned history can be\nestimated by A\u22121( \u02c6a, \u02c6b, \u02c6c, \u02c6d) = (1/p \u02c6a, 1/p \u02c6b, \u02c6c, \u02c6d), where \u02c6a, \u02c6b, \u02c6c, \u02c6d are estimated on the thinned\nhistory.\n\nP\u2212\u2192 0, as the number of realizations n \u2192\u221e.\n\n4\n\n\fNext, we focus on a more general type of intensities, called decouplable intensity. Most commonly-\nused point processes have decouplable intensities, including NHPPs, linear Hawkes processes, com-\npound Poisson process, etc.\nDe\ufb01nition 4.2 (Decouplable intensity). An intensity function is said to be decouplable, if it can be\nwritten in such a form:\n\nH(t; \u03b8) = g(t; \u03b8)T m\n\n(4)\nHere g(t; \u03b8) is a deterministic vector-valued function that is continuous with respect to \u03b8 and does\nnot contain any information regarding Ht. mH(t) is an H-predictable vector-valued measure\nthat does not contain any information regarding \u03b8. Particularly, \u03bbH(t; \u03b8) is said to be linear if\ng(t; \u03b8) = \u03b8.\n\nH(t).\n\n\u03bb\n\n(cid:2)\n\nmH(t)|G\n\n= mG\n\np(t), where mG\n\np(t) is the component of thinned intensity \u03bbG\n\n(cid:3)\n(cid:2)\u2207R( \u02dc\u03b8H)|G\n(cid:2)\n\nThis category covers a multitude of state-of-the-art models, including Netcodec [30], parametric\nHawkes [19], MMEL model [35], Granger causality for Hawkes [33], and the sparse low-rank\nHawkes [36]. The next theorem demonstrates a similar result with Lemma 4.1 for decouplable\nintensities.\nTheorem 4.3 (Thinning for parameter estimation of decouplable intensities). Consider a point\nprocess N(t) with decouplable intensity. If there exist invertible linear operators A and B satis-\nfying BE\np(t), and\npB\u22121g(t; \u03b8) = g(t; A\u03b8), then the M-estimator on thinned history can be written as \u02dc\u03b8H = A\u22121 \u02c6\u03b8G\nsuch that E\nlinear, then A = pB\u22121.\n\nP\u2212\u2192 0, as the number of realizations n \u2192\u221e. Particularly, if \u03bbH(t; \u03b8) is\n(cid:3)\n\nExample (Parameter estimation for Hawkes processes). Consider a one-dimensional Hawkes pro-\ncess with intensity \u03bbH(t; \u00b5, \u03b1) = (\u00b5, \u03b1)T\ni=1 \u03d5(t \u2212 ti). From\nthe fact that E\np(t), we obtain B = diag(1, p). Thus Theorem 4.3 yields\nA = pB\u22121 = diag(p, 1) and consequently \u00b5 and \u03b1 can be estimated by p \u02c6\u00b5 and \u02c6\u03b1, where \u02c6\u00b5 and\n\u02c6\u03b1 are estimated on the thinned history. Similar results can be obtained on multi-dimensional linear\nHawkes processes. This result reveals that the thinning operation does not change the endogenous\ntriggering pattern in linear Hawkes processes.\nRemark (Parameter estimation for multi-dimensional Hawkes processes). The thinning estima-\ntor is also valid for multi-dimensional Hawkes processes. Consider the i-th dimension of an m-\ni (t; \u00b5i, \u03b1i1,\u00b7\u00b7\u00b7 , \u03b1im) =\ndimensional Hawkes process. Its intensity function can be written as \u03bbH\n(\u00b5i, \u03b1i1,\u00b7\u00b7\u00b7 , \u03b1im)T\n, which complies with the de\ufb01nition of decouplable in-\ntensity. Theorem 4.3 again yields a thinning estimator with the linear operator A = diag(p, 1, ..., 1).\n\n1 (t),\u00b7\u00b7\u00b7 , mH\n\n, where mH(t) =\n\n= 1/pmG\n\n1, mH(t)\n\nmH(t)|G\n\n\u2211\n\n1, mH\n\nm(t)\n\n(cid:0)\n\n(cid:1)\n\n(cid:1)\n\n(cid:3)\n\n(cid:0)\n\n5 Thinning for Gradient Estimation and Stochastic Optimization\n\nSo far we have discussed how to estimate the parameter given the thinned history.\nIn fact, the\ngradient at any \u03b8 can also be recovered without knowing all the information about a point process.\nThe following theorem describes the gradient estimation on the thinned history for decoupleable\nintensity.\nTheorem 5.1 (Thinning for gradient estimation). Let N(t) be a point process with decouplable\nintensity \u03bbH(t; \u03b8) = g(t; \u03b8)T mH(t) in Eq. (4). If there exist invertible linear operators A and B\nsatisfying BE\np(t), and\npB\u22121g(t; \u03b8) = g(t; A\u03b8), then\n\np(t) is the component of thinned intensity \u03bbG\n\np(t), where mG\n\nmH(t)|G\n\n= mG\n\n(cid:3)\n\n(cid:2)\n(cid:2)\u2207R(\u03b8)|G\n(cid:2)\u2207R(\u03b8)|G\n\n(cid:3)\n(cid:3)\n\n(1) E\n(2) E\n\n6 1/pA\u22121\u2207Rp(A\u03b8), for R is LSE;\n6 A\u22121\u2207Rp(A\u03b8), for R is MLE.\n\nParticularly, if the intensity is deterministic, i.e., mH(t) = 1, both equalities hold.\n\nRemark. The thinned gradient can be transformed to a larger estimation of the full gradient, and\nan unbiased estimation for deterministic intensity. More speci\ufb01cally, the gradient estimation is\nunbiased if and only if E[H(t; \u03b8)\u03bb(t; \u03b8)] = EH(t; \u03b8)E\u03bb(t; \u03b8), as shown in the proof of Theo-\nrem 5.1. Here H usually depends on the intensity function \u03bb(t; \u03b8), such as MLE estimator has\nH(t; \u03b8) = \u2212\u2207 log \u03bb(t; \u03b8). The condition may not hold under such circumstances. For stochastic\n\n5\n\n\fintensities, the thinned gradient may be biased, yielding an estimation larger than the ground truth.\nSome empirical results on Hawkes processes are shown in Figure 3. The next theorem shows that\nthe thinned gradient has a smaller variance compared with the sub-interval gradient.\nTheorem 5.2 (Variance comparison). Let \u2207 \u02dcRG(\u03b8) and \u2207R\u2113(\u03b8) be the p-thinned and sub-interval\ngradient at \u03b8, where \u2207 \u02dcRG(\u03b8) = 1/pA\u22121\u2207Rp(A\u03b8) for LSE and \u2207 \u02dcRG(\u03b8) = A\u22121\u2207Rp(A\u03b8) for MLE.\nThe variance of the p-thinned gradient is no greater than that of the sub-interval gradient, i.e.,\n\nh\n\u2207 \u02dcR\n\nV\n\ni\n\nG(\u03b8)\n\n6 V\n\ni\n\n.\n\nh\n\u2207R\u2113(\u03b8)\ni\n\n(cid:12)(cid:12)\n\np\n\n=\n\n> \u03f5\n\ni\n\n(cid:17)\n\nV\n\n6\n\nV\n\n6\n\n+\n\nV\n\n1\np\n\n1 \u2212 p\n\nG\u2212E\u2207 \u02dcR\n\nG(\u03b8)\n\n(cid:16)(cid:12)(cid:12)\u2207 \u02dcR\n\ni2\n\nh\nE\u2207R(\u03b8)\n\u03f52\n\nh\n\u2207R\u2113(\u03b8)\n\u03f52\n\nh\n\u2207 \u02dcRG(\u03b8)\n\u03f52\n\nRemark. A Chebyshev error bound can be easily obtained, as a result of Theorems 5.1 and 5.2:\n\n(cid:3)\n(cid:2)\u2207R(\u03b8)\nP\n(cid:3)\n(cid:2)\u2207R(\u03b8)\n,\nfor any \u03f5 > 0. Since \u2207R(\u03b8) is a martingale integral (Eq. 2), we have E\u2207R(\u03b8) \u2192 0, as the number\nof realizations increases. Hence, the left-hand side probability is bounded by O(\u03f5\u22122p\u22121V\n),\nwhich shows that the gradient estimation of deterministic intensities will not be far from its true\none, if the number of realizations is suf\ufb01ciently large. Unfortunately, the result does not apply to\nstochastic intensities. Nonetheless, its effectiveness on stochastic intensities is empirically validated\non real datasets with Hawkes processes in our experiments (See Figure 4).\nThinning for stochastic optimization. We have shown that thinning can be used for estimating\nparameters and gradients with less data.This inspires us to employ it to stochastic optimization. We\npropose a novel Thinning-SGD (TSGD) method for learning a point process with a parametric inten-\nsity function, as shown in Algorithm 1. At each iteration, a thinned dataset is used for computing the\ngradient. Compared with sub-interval variance, thinned gradient has a smaller variance, so that the\nconvergence curve may have less \ufb02uctuations and \ufb01nd a path to the optimal solution faster. Thinning\nis also applicable to other gradient-based optimization algorithms such as Adam [16].\n\nAlgorithm 1: TSGD: Thinning Stochastic Gradient Descent\nInput\n\n:Event sequences {ti}, learning rate \u03b1, thinning size p, convergence criterion,\nthe objective function of a parametric point process model R(\u03b8).\n\n\u2217.\nOutput :Optimal parameter \u03b8\n\n1 Initialize \u03b8;\n2 Find A according to Theorem 4.3;\n3 repeat\n\u2032\ni from one of the sequences ti;\n4\n5\n6\n7 until Convergence criterion is satis\ufb01ed;\n\nSample a p-thinning batch t\nCompute the thinned gradient \u02dcRG(\u03b8), where \u02dcRG(\u03b8) is de\ufb01ned in Theorem 5.2;\n\u03b8 \u2190 \u03b8 \u2212 \u03b1 \u02dcRG(\u03b8) ;\n\n6 Related Work\n\nLearning of parametric point processes. Parametric point processes are the most conventional\nand popular method in the study of point processes. For example, [37] designs an algorithm ADM4\nfor learning the parameter representing the hidden network of social in\ufb02uences. [19] parameterizes\nthe infectivity parameter in Hawkes processes and employs the technique of ADMM for parameter\nestimation. [33] proposes a learning algorithm combining MLE with a sparse-group-lasso regular-\nizer to learn the so-called \u201cGranger causality graph\u201d. All these models are decouplable, therefore\nthinning is applicable to the learning of them.\nLearning of non-parametric point processes. There has been an increasing amount of studies on\nnon-parametric point processes and their learning algorithms in recent years. Isotonic Hawkes pro-\ncess [31] is an interesting and representative work among them, which combines isotonic regression\nand Hawkes processes. [1] proposes a algorithm to learn the infectivity matrix without any para-\nmetric modeling and estimation of the kernels. Another category of non-parametric models related\n\n6\n\n\fto point processes is Bayesian non-parametric models, such as [4], [12] and [25]. Besides, some\nexplorations of combining point processes and deep neural networks are emerging. Some typical\nworks include [11], [26], and [20].\nAcceleration for the learning of point processes. [17] proposes a method of low rank approx-\nimation of the kernel matrix for large-scale datasets. The online learning algorithm for Hawkes\n[34] discretizes the time axis into small intervals for learning the triggering kernels. [13] designs\na hardware acceleration method for MLE of Hawkes processes. A recent work [32] introduces an\nstochastic optimization method for Hawkes processes. Unfortunately, none of existing works con-\nsiders thinning as a sampling methods to reduce the time complexity.\nThinning for point processes. The thinning operation of point processes has been discussed mainly\nin the statistics community. Thinning is \ufb01rst used for the simulation of point processes [18, 27].\nSome limit results have been proposed [14, 29, 5], among which the property of Cox process approx-\nimation is often mentioned [10]. However, most, if not all, of these asymptotic results investigate\nthe behavior of a thinned process as the thinning level p \u2192 0, which does not serve our purpose.\n\n7 Experiments\n\nIn this section, we assess the performance of our proposed thinning sampling in three tasks: parame-\nter estimation, gradient estimation, and stochastic optimization. All the experiments were conducted\non a server with Intel Xeon CPU E5-2680 (2.80GHz) and 250GB RAM.\n\nFigure 2: Parameter estimation on a 10-dimensional linear Hawkes process with LSE. (a): the RMSE\nof estimated parameters. (b): trainning time. (c): RMSE v.s. thinning level p.\nParameter estimation. We conduct two experiments for this task on synthetic datasets. The \ufb01rst\nexperiment is to test thinning on Hawkes processes. We simulate 100 sequences of 10-dimensional\nlinear Hawkes processes and use different number of events for training. The longest sequence\nhas around 14k events. The parameters of the process are randomly generated from a uniform\ndistribution. For each dataset, we perform LSE with different histories: full data and p-thinned data\nwith p = 0.2 and p = 0.5.\nThe results are shown in Figure 2. We can see that as the number of events training increases, the\nerror (measured by RMSE) in parameter estimation decreases, at the cost of longer running time. A\n\nTable 1: Parameter estimation on state-of-the-art models.\n\nModel\n\nMMEL [37]\n\nGranger\n\nCausality for\nHawkes [33]\n\nSparse Low-rank\n\nHawkes [35]\n\nFull\nThinned (p=0.5)\nThinned (p=0.2)\nFull\nThinned (p=0.5)\nThinned (p=0.2)\nFull\nThinned (p=0.5)\nThinned (p=0.2)\n\n7\n\n38.03 (4.19)\n8.68 (1.06)\n3.94 (0.47)\n\nRMSE/Accuracy Training time (s)\n0.0568 (0.0013)\n0.0569 (0.0012)\n0.0570 (0.0012)\n0.0161 (0.0078)\n0.0163 (0.0022)\n0.0167 (0.0010)\n97.46% (0.0133)\n97.60% (0.0166)\n96.63% (0.0243)\n\n229.56 (17.87)\n65.68 (4.67)\n3.96 (1.80)\n73.76 (42.24)\n27.45 (17.51)\n4.51 (2.65)\n\n0.10.20.30500010000Number of eventsRMSE(a)0500100015000500010000Number of eventsElapsed time (sec)(b)llllllllllllllllllll0.020.030.040.050.000.250.500.751.00Thinning level pRMSE(c)Fullp=0.2p=0.5\fFigure 3: Gradient estimation for an NHPP and a linear Hawkes process using MLE and LSE. X-\naxes represent the RMSE of the parameters, and Y-axes the l2-norm of gradient with corresponding\nparameters.\n\n\u2211\n\nlarger p value yields better estimations but also runs slower. When the number of events is large\nenough, the estimation with 0.2-thinning is as accurate as that with full data, but runs an order of\nmagnitude faster. For a dataset of 14k events, 0.2-thinning only took 2 minutes, whereas the training\non full data took 26.5 minutes, and the decrease of RMSE is less than 0.01. Figure 2 (c) shows that\nRMSE decreases as the thinning level p increases.\nThe second experiment is to test thinning for learning various state-of-the-art models: MMEL [37],\nGranger Causality for Hawkes [33] and Sparse Low-rank Hawkes [35]. We generate 30 sequences\nfor each model and perform parameter estimation on different histories. The averages and standard\ndeviations of the quality metric and training time are presented in Table 1. We use RMSE as the\nmetric for MMEL and Granger Causality, and the accuracy of non-zero entries in the adjacency\nmatrix for Sparse Low-rank Hawkes. It can be seen that thinning signi\ufb01cantly reduces the training\ntime of all models without compromising much estimation quality.\nGradient estimation. We consider two types of point process: a non-homogeneous point process\nwith deterministic intensity \u03bb(t; a, b, c, d) = a+b sin(ct+d); and a linear Hawkes process with H-\nintensity \u03bbH(t) = (\u00b5 + \u03b1\n\u03d5(t \u2212 ti)). The gradient at different values of parameters is computed\nand depicted in Figure 3.\nThe result shows three facts. First, every line in the \ufb01gure touches X-axis at the origin, except for\nNHPP(c) (indifferentiable). This phenomenon demonstrates that thinning sampling yields asymp-\ntotically unbiased parameter estimation, no matter for LSE or MLE. Second, we can see that red\nand blue lines in the results of \ufb01rst 6 sub-\ufb01gures overlap signi\ufb01cantly, which con\ufb01rms that thinning\ngives unbiased gradient estimation for deterministic intensities. Third, in the last two sub-\ufb01gures,\nblue lines tend to be on or above the red ones, which demonstrates that thinning makes gradient\nestimation larger or equal to the ground truth for stochastic intensities.\nStochastic optimization. We test thinning sampling for stochastic optimization algorithms, includ-\ning SGD and Adam. The task is to learn a linear Hawkes process. We test Thinning (p=0.1), sub-\ninterval sampling (SubInt), the stochastic optimization learning algorithm (StoOpt) proposed in [32],\ncombined with SGD, ADAM and the typical gradient descent (GD). We test on 4 datasets:\n\u2022 Synthetic dataset: We simulate 10 realizations of a 5-dimensional linear Hawkes process, with\nparameters generated from a uniform distribution. The dataset contains 20k events. We train the\nmodel using the entire dataset and the RMSE between the estimated parameters and the ground\ntruth is shown as test error.\n\u2022 IPTV dataset [24]: The dataset consists of IPTV viewing events, which records the timestamps\nfor multiple users watching a video, and the category that the video belongs to. Each user is\n\n8\n\n0.000.050.100.150.20\u22120.2\u22120.10.00.10.2NHPP(a, b) + MLE0255075100125\u22120.2\u22120.10.00.10.2NHPP(c) + MLE0.000.050.100.15\u22120.2\u22120.10.00.10.2NHPP(d) + MLE0.000.050.100.150.200.25\u22120.2\u22120.10.00.10.2Hawkes + MLE0.000.010.02\u22120.2\u22120.10.00.10.2NHPP(a, b) + LSE0100200300\u22120.2\u22120.10.00.10.2NHPP(c) + LSE0.00.10.20.3\u22120.2\u22120.10.00.10.2NHPP(d) + LSE0.00.10.20.30.40.5\u22120.2\u22120.10.00.10.2Hawkes + LSEFull dataThinning (p=0.25)\fFigure 4: The average convergence curves of different learning algorithms on different datasets.\n\ntreated as a realization and each category as a dimension. We select 7 and 3 realizations with\n22k and 9k events as training and test datasets, respectively. The number of categories is 16.\n\u2022 NYC taxi dataset: The data is from The New York City Taxi and Limousine Commission1,\nwhich records \ufb01elds capturing pick-up time, location and payment information of green taxis\u2019\norders. We select those trips starting from Manhattan district in the \ufb01rst 10 days of January 2018\nand use the 14 areas as dimensions. The training and test datasets contain 60k and 12k events,\nrespectively.\n\u2022 Weeplace dataset [23]: This dataset contains the check-in histories of users at different loca-\ntions. The categories of events include food, education, outdoors, shops, and 10 others. The\ncheck-in histories of 46 and 10 users are selected as training and test dataset, respectively. The\nsizes of the datasets are 50k and 11k.\n\nWe ran each method on each dataset for 10 times. Figure 4 presents the average convergence curves\nof each method on different datasets. Training of GD failed to \ufb01nish the \ufb01rst iteration given the\nmaximum time shown in Figure 4 for each dataset and thus its results are not presented. From the\nlearning curves, we can see that Thinning+Adam outperforms all the competitors in terms of test\nerror on all the datasets. When looking at the SGD group alone, Thinning also achieves the lowest\ntest error. From the bottom row, we see that Thinning+Adam tends to have less \ufb02uctuated learning\ncurves. Especially on Weeplace and NYC taxi datasets, the \ufb02uctuations of StoOpt and SubInt are\ndramatic. This is due to the fact that thinning sampling can better capture the information of the\nwhole timeline, whereas other methods are prone to a zigzag of searching path.\n\n8 Conclusion & Discussion\n\nIn this paper, we discussed thinning as a downsampling method for point processes. Thinning\noperation uniformly compresses the intensity on time axis, but its structure is completely preserved.\nIn this way, for parameter estimation, similar performance can be achieved with less input data, as\nshown in the experiments. We also demonstrated how to estimate gradient on the thinned history,\nwhich leads to a novel stochastic optimization algorithm, called TSGD. Experimental results show\nthat TSGD converges faster and has a learning curve with less \ufb02uctuations, which can be explained\nby the theorem that the thinning estimator for gradient has a smaller variance.\nIn future work, it would be interesting to study other sampling methods, such as Jackknife resam-\npling, for point processes. This work focuses on point processes with decouplable intensities. It will\nalso be interesting to explore a broader assumption to serve more scenarios.\n\n1https://www1.nyc.gov/site/tlc/index.page\n\n9\n\n0.0120.0150102030Elapsed time (sec)Test error (RMSE)Synthetic dataset51015200100200300Elapsed time (sec)Test error (NegLogLik)IPTV dataset\u22121.0\u22120.50.00.50200400600Elapsed time (sec)Test error (NegLogLik)NYC Taxi dataset\u22120.50\u22120.250.000.250100200300Elapsed time (sec)Test error (NegLogLik)Weeplace dataset3.94.24.54.85.10102030Elapsed time (sec)Training error (NegLogLik)Synthetic dataset\u221210010200100200300Elapsed time (sec)Training error (NegLogLik)IPTV dataset1.01.52.02.50200400600Elapsed time (sec)Training error (NegLogLik)NYC Taxi dataset2.53.03.54.00100200300Elapsed time (sec)Training error (NegLogLik)Weeplace datasetThinning+AdamThinning+SGDStoOpt+AdamStoOpt+SGDSubInt+AdamSubInt+SGD\fAcknowledgment\n\nThis work is partially supported by the Data Science and Arti\ufb01cial Intelligence Research Centre\n(DSAIR) and the School of Computer Science and Engineering at Nanyang Technological Univer-\nsity.\n\nReferences\n[1] Massil Achab, Emmanuel Bacry, St\u00e9phane Ga\u00efffas, Iacopo Mastromatteo, and Jean-Fran\u00e7ois Muzy. Un-\ncovering causality from multivariate hawkes integrated cumulants. The Journal of Machine Learning\nResearch, 18(1):6998\u20137025, 2017.\n\n[2] Per K Andersen, Ornulf Borgan, Richard D Gill, and Niels Keiding. Statistical models based on counting\n\nprocesses. Springer Science & Business Media, 2012.\n\n[3] Emmanuel Bacry, St\u00e9phane Ga\u00efffas, and Jean-Fran\u00e7ois Muzy. A generalization error bound for sparse\n\nand low-rank multivariate hawkes processes. arXiv preprint arXiv:1501.00725, 2015.\n\n[4] Charles Blundell, Jeff Beck, and Katherine A Heller. Modelling reciprocating relationships with hawkes\n\nprocesses. In Advances in Neural Information Processing Systems, pages 2600\u20132608, 2012.\n\n[5] Fred B\u00f6ker. Convergence of thinning processes using compensators. Stochastic processes and their\n\napplications, 23(1):143\u2013152, 1986.\n\n[6] Clive G Bowsher. Modelling security market events in continuous time: Intensity based, multivariate\n\npoint process models. Journal of Econometrics, 141(2):876\u2013912, 2007.\n\n[7] Pierre Br\u00e9maud. Point processes and queues: martingale dynamics, volume 50. Springer, 1981.\n\n[8] David R Brillinger et al. The identi\ufb01cation of point process systems. The Annals of Probability, 3(6):909\u2013\n\n924, 1975.\n\n[9] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume I: Elemen-\n\ntary theory and methods. Springer Science & Business Media, 2002.\n\n[10] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general\n\ntheory and structure. Springer Science & Business Media, 2007.\n\n[11] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Re-\ncurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555\u20131564.\nACM, 2016.\n\n[12] Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alexander J Smola, and Le Song. Dirichlet-hawkes pro-\nIn Proceedings of the 21th\ncesses with applications to clustering continuous-time document streams.\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 219\u2013228.\nACM, 2015.\n\n[13] Ce Guo and Wayne Luk. Accelerating maximum likelihood estimation for hawkes point processes. In\n2013 23rd International Conference on Field programmable Logic and Applications, pages 1\u20136. IEEE,\n2013.\n\n[14] Olav Kallenberg. Limits of compound and thinned point processes. Journal of Applied Probability,\n\n12(2):269\u2013278, 1975.\n\n[15] Alan Karr. Point Processes and Their Statistical Inference, volume 7. CRC Press, 1991.\n\n[16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[17] R\u00e9mi Lemonnier, Kevin Scaman, and Argyris Kalogeratos. Multivariate hawkes processes for large-scale\n\ninference. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[18] PA W Lewis and Gerald S Shedler. Simulation of nonhomogeneous poisson processes by thinning. Naval\n\nresearch logistics quarterly, 26(3):403\u2013413, 1979.\n\n[19] Liangda Li and Hongyuan Zha. Learning parametric models for social infectivity in multi-dimensional\n\nhawkes processes. In Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\n10\n\n\f[20] Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes\nvia reinforcement learning. In Advances in Neural Information Processing Systems, pages 10781\u201310791,\n2018.\n\n[21] Tianbo Li, Pengfei Wei, and Yiping Ke. Transfer hawkes processes with content information. In 2018\n\nIEEE International Conference on Data Mining (ICDM), pages 1116\u20131121. IEEE, 2018.\n\n[22] Scott Linderman and Ryan Adams. Discovering latent network structure in point process data. In Inter-\n\nnational Conference on Machine Learning, pages 1413\u20131421, 2014.\n\n[23] Bin Liu, Yanjie Fu, Zijun Yao, and Hui Xiong. Learning geographical preferences for point-of-interest\nIn Proceedings of the 19th ACM SIGKDD international conference on Knowledge\n\nrecommendation.\ndiscovery and data mining, pages 1043\u20131051. ACM, 2013.\n\n[24] Dixin Luo, Hongteng Xu, Yi Zhen, Xia Ning, Hongyuan Zha, Xiaokang Yang, and Wenjun Zhang. Multi-\ntask multi-dimensional hawkes processes for modeling event sequences. In Twenty-Fourth International\nJoint Conference on Arti\ufb01cial Intelligence, 2015.\n\n[25] Charalampos Mavroforakis, Isabel Valera, and Manuel Gomez-Rodriguez. Modeling the dynamics of\nlearning activity on the web. In Proceedings of the 26th International Conference on World Wide Web,\npages 1421\u20131430. International World Wide Web Conferences Steering Committee, 2017.\n\n[26] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate\n\npoint process. In Advances in Neural Information Processing Systems, pages 6754\u20136764, 2017.\n\n[27] Yosihiko Ogata. On lewis\u2019 simulation method for point processes. IEEE Transactions on Information\n\nTheory, 27(1):23\u201331, 1981.\n\n[28] Yosihiko Ogata. Statistical models for earthquake occurrences and residual analysis for point processes.\n\nJournal of the American Statistical association, 83(401):9\u201327, 1988.\n\n[29] Richard Serfozo. Thinning of cluster processes: Convergence of sums of thinned point processes. Math-\n\nematics of operations research, 9(4):522\u2013533, 1984.\n\n[30] Long Tran, Mehrdad Farajtabar, Le Song, and Hongyuan Zha. Netcodec: Community detection from\nindividual activities. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages\n91\u201399. SIAM, 2015.\n\n[31] Yichen Wang, Bo Xie, Nan Du, and Le Song. Isotonic hawkes processes. In International conference on\n\nmachine learning, pages 2226\u20132234, 2016.\n\n[32] Hongteng Xu, Xu Chen, and Lawrence Carin. Superposition-assisted stochastic optimization for hawkes\n\nprocesses. arXiv preprint arXiv:1802.04725, 2018.\n\n[33] Hongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning granger causality for hawkes processes.\n\nIn International Conference on Machine Learning, pages 1717\u20131726, 2016.\n\n[34] Yingxiang Yang, Jalal Etesami, Niao He, and Negar Kiyavash. Online learning for multivariate hawkes\n\nprocesses. In Advances in Neural Information Processing Systems, pages 4937\u20134946, 2017.\n\n[35] Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks using\n\nmulti-dimensional hawkes processes. In Arti\ufb01cial Intelligence and Statistics, pages 641\u2013649, 2013.\n\n[36] Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks using\n\nmulti-dimensional hawkes processes. In Arti\ufb01cial Intelligence and Statistics, pages 641\u2013649, 2013.\n\n[37] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional hawkes pro-\n\ncesses. In International Conference on Machine Learning, pages 1301\u20131309, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2257, "authors": [{"given_name": "Tianbo", "family_name": "Li", "institution": "Nanyang Technological University"}, {"given_name": "Yiping", "family_name": "Ke", "institution": "Nanyang Technological University"}]}