__ Summary and Contributions__: This paper proposes a version of noise-contrastive estimation (NCE) method to alleviate computational cost for multivariate point processes and provides its theoretical guarantees. The authors evaluate their work on both synthetic and real-world datasets and show that their method achieve comparable results with much less computational time compared with baselines. However, the assumptions shown in the theoretical part seems to mismatch with the experimental results.

__ Strengths__: Applying NCE to make the learning of point process scalable is a very good idea. Moreover, the authors provide theoretical support on the rationality of the proposed learning strategy, which improves the solidness of the proposed method. The proof seems correct.

__ Weaknesses__: The main concern is the experimental part. Although the training/testing likelihood is reasonable for evaluating the convergence and the performance of the proposed method, I would like to see more comparisons on predictive tasks in real-world data sets.
Additionally, the assumption 1 in the paper may be questionable in some situations. The continuity is a strong assumption on the intensity function, which will lead the proposed theoretical work to be inapplicable to many widely-used point processes, e.g., Hawkes process and self-correcting process, whose intensities are not continuous. Because the authors apply some complicated point process models, e.g., neural Hawkes process, and achieve encouraging performance, this assumption may be redundant or can be relaxed. In particular, I wonder if the assumption of Riemann integrable can be replaced with Lebesgue integration?
Overall, I think it is nice work, but the conflict on the assumption and the experimental settings prevents me from accepting this work directly.
Minors: The font size of texts in figures should be enlarged. The information of the last reference (Xu et al. 2018) is wrong. It was published at IJCAI.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: After the rebuttal I tend to accept this paper. I am satisfied with my concerns being addressed and the addition of the new experiments to the revised version.

__ Summary and Contributions__: The paper proposes a novel noise-contrastive estimation for multivariate point processes. The authors evaluate their method on both synthetic and real-world datasets, and show that the proposed method takes much less wall-clock time while still achieving competitive log-likelihood.

__ Strengths__: The idea of applying NCE to point process looks interesting, even though it is not the first time to be proposed. The research question of finding an efficient estimator is relevant to the NeurIPS community. The paper is well-written and clear.

__ Weaknesses__: The NCE estimator for MPP looks interesting. However, the paper suffers from a number of flaws that should be better addressed.
1. The paper proposes an NCE estimator for MPP. However, this is not the first attempt to apply NCE for point processes. The INITIATOR model (Guo et al., 2018) has already attempted to do so. I believe the extension from univariate point processes to multivariate ones should not be considered as a significant contribution.
2. Whether the advantages of NCE are applicable to point processes is a question. The main benefit of NCE is to reduce the computational cost of MLE. However, the proposed method involves a sampling procedure, which is usually time-consuming. The authors also fail to consider and compare with other existing estimators for point processes that are more efficient than MLE. For example, the least-square estimator, (which has been integrated into the python library “tick” for learning point processes) even has a closed-form solution for learning linear multivariate Hawkes processes. Further, the broader category of martingale estimator, which LSE falls in, also possesses the desired properties of consistency and asymptotic normality. These commonly-used methods should also be mentioned and discussed.
3. The theoretical properties seem to be inherited from the NCE, rather than being derived from the proposed incorporation.
4. The empirical evolution is week. The paper only involves one baseline (NHP) with MLE as the underlying ground truth. More baselines should be considered, such as parametric point processes (vanilla Hawkes processes), the recurrent marked point processes, etc.

__ Correctness__: The claims and method seem to be correct.

__ Clarity__: The paper is well written and clear.

__ Relation to Prior Work__: The paper missed quite a few methods that should be taken into account:
the least-sequare estimator for point processes, the recurrent marked point processes, as well as many parametric point process models.

__ Reproducibility__: Yes

__ Additional Feedback__: Please see above.

__ Summary and Contributions__: This paper proposes a noise-contrastive estimation for point process which is expected to compute efficiently. The authors also prove that optimality can be achieved under mild assumptions. Empirical experimental results are used to demonstrate its efficiency and usefulness.

__ Strengths__: They develop a new learning algorithm for point process using the idea of contrastive noise estimation.
Optimality and efficiency is guaranteed through theoretical analysis.

__ Weaknesses__: As there are already methods like work of Guo et.al which speed up the learning process of point process, addition of the comparision with those methods makes the experiment more convincing.

__ Correctness__: The logic is sound and supported by experiments.

__ Clarity__: The paper is easy to follow and clearly written.

__ Relation to Prior Work__: Related works are well addressed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: While the authors' response addressed some of my concerns it is not enough to raise my rating.
The paper describes how noise contrastive estimation can be used to train generative models of multivariate point-processes in continuous time. The authors show that training is faster, and in most cases of similar quality, than training with maximum likelihood estimation.
The authors proof that under mild assumptions their method fulfils theoretical guarantees and converges to the true parameters for infinite data.
They apply the method to multiple synthetic and real datasets and show how different parameters affect the outcome of the training and perform ablation studies.
The authors discuss related works and give their thoughts on the broader impact of their work.

__ Strengths__: The paper is clearly written, describes well what is needed to change NCE to work for multivariate point processes and gives enough information to fully reproduce their results.
It is a good contribution and the authors provide ample theoretical grounding for their claims and evaluate their method on a range of datasets. They provide multiple ablation studies and discuss their choice of parameters in detail, especially in the supplementary material.

__ Weaknesses__: I was missing a discussion and comparison of other ways to approximate the log likelihood, e.g. variational approximations or monte carlo estimates. It would also be interesting to see what simple baseline log likelihood models would have achieved on the data. The authors show that for some of the data using a Poisson process as q achieves very good results but not if assuming p to be a simpler model would work as well. In general the related works section could be a bit broader to touch on methods beyond NCE.
While the authors compare runs of NCE with different values for parameters like C and M, it would have been more informative to show a plot of the relationship of these parameters to convergence speed directly, instead of just having multiple runs in the same likelihood plot.
I think it is a good contribution but not a huge step from prior work on NCE for point processes.
Given that the main advantage over training with MLE is the computational complexity it would also be nice to have shown its results on data where MLE is not feasible.

__ Correctness__: Yes, I haven't found any incorrect statements.

__ Clarity__: The paper is overall very well written and easy to understand. The plots are very small on the papers making some of the annotations impossible to read if printed and only readable if zoomed in closely on a computer.

__ Relation to Prior Work__: Yes, the authors clearly distinguish from prior work and explain limitations of similar approaches and how they needed to modify the method to work for multivariate point processes in continuous time.

__ Reproducibility__: Yes

__ Additional Feedback__: