Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper deals with parametric point processes on the real line. The authors show that thinning a point process by keeping each point at random with given probability $p$ is a method that compresses the intensity, but preserves its structure. Hence, it provides a downsampling method. The method seems to be new, even if it is not a major breakthrough. It is more elegant than the different techniques to paste sub-interval downsampling, and the proofs given in the paper are quite general. Yet, it misses an evaluation of the uncertainty of the estimate. The paper is clearly written, even if it is sometimes unnecessarily abstract (see, eg. Definition 2.2 of the stochastic intensity). By way of example, the theoretical results are applied to two particular parametric cases: non-homogeneous Poisson point processes and Hawkes processes. This is a good idea because it helps to understand the general theoretical results, and to see their possible use.
The paper rigorously tackles the question of what is the best way to learn point process parameters given the presence of very long sequences which are samples drawn from the point process. There are two ways of training the models by using either sub-intervals or, as recommended in this paper, by thinning the original sequences to make them shorter. The authors first rigorously establish the classes of models for which thinning can work, show that the variance of the gradient of the residue calculated for the thinned sequence is going to be smaller than that calculated over the sub-interval sequence, and show using experimentation that thinning manages to learn the state-of-the-art models. The paper is very well written; each lemma/theorem/definition is followed by a clear example and explanation. The presentation is also smooth and the paper is approachable while remaining rigorous. However, there still are a few parts of the paper which could do with more explanation. Firstly, the authors hint at the gradient for stochastic intensities potentially being unbiased in line 186-190. An example here and potential discussion of the limitations would help contextualize the contribution better, Also, there seems to be a rather sudden jump from learning/modelling one intensity function to multi-variate intensity functions between Definition 4.2 and Experiments. Overall, I believe that the paper makes a significant contribution and should be accepted for publication. Minor: - Line 66: Missing period. - Eq (2): R(.) is undefined. - Line 196: "of applying" - Missing Theorem (reference) in Supplementary, above eqn. 9.
The thinning idea of learning point processes is interesting. The paper is well written. The only concern I have is on the applicability of the proposed model. In the real world experiments, only the task to learn a Hawkes process is discussed. However, Hawkes process is a weak baseline and there are many other point process models that are shown to have better performance than Hawkes processes on the IPTV and taxi data. It would improve the paper if these models can be compared and discussed. -------------- thanks the authors for your response, which addressed my concerns. I changed the score accordingly.
This paper presents a unique approach to computationally efficient parameter estimation in point process models using thinning properties. The problem is well-motivated, and the background should be clear to those with some familiarity with point process theory. The derivations are thorough and the experiments provide validation to the claims of computational efficiency and accuracy in the case of deterministic intensities. Overall, the goal of accelerating inference for point process models is a practically relevant one for the ML community, and this paper opens the door to further valuable research on computationally efficient, low-variance gradient estimators for more complicated point process models. Though overall this is a good paper, I recommend the following improvements: First, the paper will likely not be digestible to those without significant prior knowledge of point process theory. To broaden the potential audience, I recommend adding some additional background on point processes (perhaps to an appendix; the background in section 2 is quite dense). Second, theorem 4.3 could use additional clarity. The assumption of decouplable intensity functions is clear, but an explanation of how restrictive the assumptions on A and B are in practice would be useful (perhaps add some examples where the assumptions are not satisfied). Third, I think the analysis in section 5 on the bias of the gradient estimators is lacking. The claim in lines 188-189 seems unsubstantiated ("a larger gradient is not a bad thing..."). Please expand upon why we shouldn't be worried about the biases of your estimators when the intensity is not deterministic.