Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes a neural network model for temporal point processes. The model uses a neural network rather than a parametric form to model conditional intensity function. To overcome the need to integrate the intensity function to compute likelihoods, the model estimates the cumulative intensity function, which can easily be differentiated. Experimental results on synthetic and real data sets show superior performance compared to state of the art. I believe this paper has the potential to be very impactful. The proposed model is clever and effective, yet simple to understand and implement. My main reservation is that evaluation results were limited to likelihood-based metrics (log likelihood scores and intensity function curves). The paper would have been even stronger if it contained metrics that are tailored to concrete prediction tasks (e.g. time prediction error). Minor:L95: I understand that training with long sequences can be problematic due to gradient vanishing/explosion but the word "intractable" gives the impression that the problem is computational complexity. ====================================================== I thank the authors for their response and for conducting prediction experiments.
In general, I liked this approach. It is a new an interesting take on the problem and one that seems obvious in retrospect (which is often a sign of a good idea). I was happy to read the paper and feel that the idea should be generally communicated to the field as a whole. I am concerned that the paper fails to give the CT-LSTM model of  its full due, however. The introduction states that the hazard (or intensity) functions of previous work are either constant or have a rather fixed form (such as an exponential asymptote).  is a noteable exception to this. The hidden state does exponentially decay, but it is multi-valued (ie a vector) and a intensity is a non-linear function of the hidden state and therefore can have mroe complex behavior. While the introduction, as written, is true, it does not acknowledge this fact. Further, in Section 3 (Related works), this work is set to the side, stating that it performs very similarly to the RNN model of Du et al. This may or may not be true (my own experience is more mixed), but it is notable that the experiments do not compare to this single other method that could produce more complex intensity functions. If the authors had backed up such a statement (that CT-LSTM does not do as well) with experimental results showning it, this paper would be *significantly* stronger. As it is, I am left wondering if the other non-exponentially decaying method (CT-LSTM) would do as well as the proposed method in this paper. Secondly, I am concerned about the training procedure for the "exponential" model in the experimental results. This uses the intensity function of Equation 6. This can exactly model a single exponential kernel Hawkes process. Even for a HP with a kernel that is the mixture of two exponentials, the single exponential can often do reasonably well. Yet, these results are worse than for the piecewise constant model and Figure 3 suggests it has not fit the parameters of the exponential properly (although this is hard to tell, as this might be a testing example). That the exponential model *can* fit these models well does not, of course, state that in practice it will. However, if this comes down to a matter of the training/fitting procedure and whether one method is more robust than another, we need more detailed experimental results demonstrating this. As it is, I am left worried that the authors did not try "hard enough" to get the competing models to fit.
The proposed model mainly focuses on modeling the integration of intensity function. The paper is well written and easy to understand. The proposed model is mainly based on monotonic networks and technical novelty is a bit incremental as it is more like an application of monotonic networks to point processes. As for experimental evaluation, the paper only shows the improvement of log-likelihood evaluation, while a major application of point process models is prediction, such as event prediction and time prediction (see experiments in Du et al 2016). The intensity function is important in making this prediction, however it is not clear to me how to use this proposed network to derive the intensity function and make predictions, and this aspect is also not evaluated in the experiments. ----------------------------------------------------- the authors' response addresses my concerns, and I changed the initial score