NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7007
Title:Uncertainty on Asynchronous Time Event Prediction

Reviewer 1

This paper makes a compelling case for explicitly modelling a time varying distribution over the simplex for asynchronous events prediction. This in itself makes the paper a valuable contribution. The use of gaussian process to model the time dependency along with pseudo point from RNN is original, but not particularly well motivated. The authors even mention that an alternative of modeling the time evolution of distribution using RNN directly, which seems like the natural approach to avoid relying on pseudo points. However the two approaches merits and disadvantages are never compared. The paper is mostly well written and motivates the methodology well, but could do with more various topics such as the impact of UCE vs CE, how training looks in practice, or the choice of many pseudo points to use.

Reviewer 2

This paper introduces 2 methods to make predictions of asynchronous multi-class events. The methods pay special attention to giving confidence values to their predictions. The models model the changing distribution of outcomes given a time in the future, assuming that no other event has happened in the meantime. The first method (WGP-LN) uses an RNN to output a set of M control (pseudo) points to mode a Gaussian process. At first they describe learning a Gaussian process for each output class, but they discover that this leads to overconfidence at the control points. They then proceed to train the RNN to also output weights for the Gaussian process. Since there are a small number of control points (<10) the Gaussian process is fast to compute, and is also differentiable, thus enabling fast training with sgd. The second method uses the Dirichlet distribution, which is similar in practice, the RNN still outputs a set of Gaussian basis functions and weights. The calculation of loss for this method involves proposed closed-form expressions, which are then approximated with a second order series expansion. They also define a loss function (the Uncertainty cross-entropy loss) which is more suited to learn uncertainty on the categorical distribution. Also introduced it a point-process framework which can be used to predict the most likely time the next event is expected. The related work section seems a bit thin, but I am not an area expert. They provide many experimental results which show their method consistently out performing other methods. They generate toy examples, and show an ability to predict the evolution of next events, along with the confidence for those predictions. The appendix appears to cover the details of the methods and results. The supplementary materials also contains an excellent looking iPython notebook containing Tensorflow implementations of the methods and toy examples. I think the problem described (asynchronous, multi-class event prediction WITH attention being paid to the confidence of the prediction) is of great importance to the community, and these methods appear to be solid contributions. The code may be useful to many. Provided other reviewers pass the math, I think this is a good paper.

Reviewer 3

The authors target a very particular problem, that of predicting the uncertainty in predicting the type of event which is going to happen asynchronously in the future and where the probability of the event is dependent on the time. This is an important problem and is different from other settings in uncertainty prediction which have been explored elsewhere. The paper is very well written (besides some minor reorganization issues) and the illustrations are of high quality. The techniques described are sound and novel, and bringing them together is an important contribution. The experiments are described in adequate detail and the provided code is relatively easy to parse through and re-run to reproduce a subset of the results. However, there are two points which could improve the quality and clarity of the submission. The first is that the related work [6, 13] is mildly mischaracterized, and the cost of introducing sampling are not fully elucidated. Most notably, Neural Hawkes Process can successfully model multi-modal distributions of events, akin to FD-Dir-PP do it, and, hence, its characterization (e.g. in line 230) could be made better. Similarly, while it is true that RMTPP models type and time of the next event independently of each other (which precludes multi-modal event distributions) it does so in order to allow rapid training and efficient use of GPUs. This is a subtle difference, which leads to the second point of the true cost of the sampling step. Tensorflow, and GPU based training in general, works best when the entire training iteration happens on the GPU (i.e. no use of feed_dict with computed values). However, the sampling step, to the best of my knowledge, cannot be done on the GPU, and needs to happen on the CPU. The Neural Hawkes Process [13] too suffers from this, because they need to perform Monte-Carlo sampling in order to numerically evaluate integrals. This otherwise minor detail introduces a significant bottleneck in the training times. I believe an honest discussion of the pros and cons of the approach adopted would further embellish the paper and help put the contributions in context. Along the same line, it is unclear what the tradeoff of increasing/decreasing the number of samples 'M' is on the training time/quality and that pareto-optimal front would also be interesting to speculate/demonstrate. Given the overall contributions of the paper, I do not have any hesitation in recommending it for publication. Minor points: - Line 131: "a a" - Line 94: y_j^{(c)} is not defined without looking at the appendix. - Line 137: Claim about the model being fully differentiable is not justified until the loss function is provided. - Line 386: First denominator, the sum should go till C. - Line 444: Missing Figure reference. - Line 459: Comment in blue. ---------------------- Update after the author response: I have read the response and it has cleared up some misconceptions I had. A more nuanced treatment of the related work will also be appreciated.