Reviews: Nonparametric Regressive Point Processes Based on Conditional Gaussian Processes

The paper considers how to create a flexible method for modelling Hawkes-like processes with flexibility in the triggering kernel, using Gaussian processes working on a different input than is usually attempted. Section 1-3 is well written, very clear and gives a good motivation and description of how to get a GP regressive point process. The idea in the preliminary work that the likelihood of a Hawkes process factorises across different types is surprising as the triggering kernel in the CIF appears to connect $\lambda_{u_{i}}(t)$ and $\lambda_{u_{j}}(t)$, since $u_{i}$ and $u_{j}$ appear to be part of a single triggering kernel, and hence are connected. Can the authors clarify, is there some conditional independence I am not spotting here? It doesn't appear to be of significance however as the rest of the paper only focusses on a single event type in the end. The idea of introducing an additional kernel working on some augmented data in the form of the evaluation of an indicator function is a clever one that skirts the issue that certain contributions would otherwise be undefined. I believe such a kernel could be used in other applications and as such is, as far as I am aware, a novel and useful contribution. Section 4, the most significant methodological section, lacks clear motivation however. It isn't clearly discussed why a conditional distribution approach is needed in the first place. Why do we need such a conditional GP model instead of using the kernel defined in section 3 and following Lloyds derivation from there? It is indicated that this is to 'encode the dependencies of the CIF on the past events', but I think this could really do with clarifying. Perhaps the authors could expand on this in both their rebuttal and in any revision of the manuscript? In Section 4, it appears odd to introduce additional noise into f_{Z} without justification, rather than simply using equation (4). I presume this is because it is intractable in (4) but the additional noise results in it being tractable in (6). However it is not immediately clear why this is done, or what the resulting implication are. For example the draws from f_{Z} are now 'rough' rather than smooth, and not differentiable, this doesn't seem like desirable behaviour. Is S_{\epsilon} learnt to be very small or is it reasonably large? Presumably this would effect the quality of the approximation. Can the author clarify why this is done and illustrate the resulting implications on the approximation being made? The synthetic experiments illustrate that the triggering kernel can be learnt well non-parametrically and a brief discussion illustrate that it is able to model triggering kernels that contain both excitation and inhibition, a property most Hawkes process approximations don't have. I appreciate the authors efforts of doing a relatively in-depth study of the models properties here rather than just providing results on a bunch of experiments. For the IPTV it's not clear why CGPRPP should not be able to model bursty events. If it's non-parametric shouldn't it adapt to this situation given enough inducing points? Given CGPRPP's apparent flexibility but (slightly) worse performance, it feels like some additional digging for some reasons would be useful if the answer isn't clear. Is the model overfitting? Does it need additional kernel contributions for flexibility or periodic components? Do the inducing inputs need to be placed more strategically? Are simply more of them needed? For the MIMIC data there is a nice illustration for 331 showing inhibition that GP-GS can't model, but there is no explanation of what goes *wrong* in the classes the CGPRPP seems to struggle with. I think this would be enlightening. It is ok to not always give ground-breaking improvements in predictive performance, but it is useful to know *why*. I think what is sorely missing however is a comparison to at least one method that is more similar, and not parametric, to show that the method is competitive. For example [Rousseau et al], [Xie et al.], etc. It seems unfair to only compare to relatively inflexible models when more flexible ones already exist. Minor comments: 167. It should be made clear in the main text what 'some restrictions' are. 238. Says HP-GS and CGPRPP perform the same, but that doesn't seem quite true, in fact CGPRPP marginally outperforms HP-GS here. [Efficient Non-parametric Bayesian Hawkes Processes. R Zhang, C Walder, MA Rizoiu, L Xie.] [Nonparametric Bayesian estimation of multivariate Hawkes processes. S Donnet, V Rivoirard, J Rousseau]

Originality: The work proposed in the paper is original in a way that they propose a GP for modeling PP where the kernel of GP is a function of inter-arrival times instead of the absolute time of the event in the sequence. They further develop a methodology to learn the GP based on latent points in an efficient manner. Clarity: The paper is well written and easy to follow mostly apart from some discrepancies as suggested below in improvements. Most of the differences from existing work and novelty in the paper are well laid out. Still paper puts a huge burden on the reader to go through a massive amount of supplementary material to fully understand the working of the learning procedure, it is the main contribution and at least be intuitively explained how is it derived. The experiments section clearly explains the parameters used for competitive methods and the proposed methods to the extent that the reader can infer them after looking them in the code provided.

Paper ID:	630
Title:	Nonparametric Regressive Point Processes Based on Conditional Gaussian Processes

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4