NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5251
Title:Meta Learning with Relational Information for Short Sequences

Reviewer 1

The authors propose a mixture of Hawkes process which considers relational information. All event occurrent times of each subject is treated as a mixture of K common Hawkes process, which means the temporal dependency is fully modelled by the Hawkes process. The Hawkes process parameters of subject i have a small deviation to the common K common Hawkes process and the deviation is determined by the likelihood’s gradient (Eq (1)), which can be learned using a method called MAML. The relational information (if subject i and j are linked) is modelled by the weights of the K common Hawkes process for subject i and j. Variational inference (E-step) is used to approximate the posterior of Z and \pi given the parameters of K common Hawkes process and the data. Then the parameters of K common Hawkes process are maximised. The method is compared with other similar approaches and achieves better performance. The idea of incorporating the relational information and subject special parameter (MAML) parts are very interesting. The derivations seem reasonable and correct. I would like to accept this paper if the author could address some questions regarding the experiments. 1. Is it standard to compare the log-likelihood in this field? In my understanding, the author are tackling a prediction problem. Is it possible to predict the expected occurrence time of the next event and compare the RMSE? Also since this method is optimising the log likelihood while others are not, the comparison seems unfair. 2. Why all methods are Hawkes process based? Is it possible to compare with RNN? 3. In line 198-199, it is said that only the last time stamp is taken out. Does all other time stamps used in the training? My understanding is that all time stamps should be taken out for the test data.

Reviewer 2

In this paper, the authors propose a hierarchical Bayesian mixture of Hawkes processes with a parameter adaptation mechanism based on a meta-learning technique for modeling multiple short event sequences with graph-like side information. In the proposed model, each sequence is modeled by a mixture of Hawkes processes, whose mixture ratio has relation to the adjacency of the sequence to the other sequences. Moreover, the parameters of the component Hawkes processes are slightly varied among sequences using the mechanism of the model-agnostic meta-learning framework. The authors provide experimental results on synthetic and real-world datasets, which show the superiority of the proposed method. Overall, the paper is very well written. The technical details are explained in an easy-to-follow way, and the proposed method is clearly positioned in the context of event sequence modeling using Hawkes processes. I am not completely sure about the motivation to use MAML to adapt the parameters to each sequence. What is the biggest advantage of using MAML instead of just fluctuating $\theta_k^{(i)}$ around $\theta_k$ using some regularization terms? Or in other words, isn't it possible to consider the multi-task learning (using graph information) for mixtures of Hawkes processes? The figures in Table 1 would become easier to interpret if the colors of the nodes in the right three columns are roughly aligned to those in the leftmost column geometrically. In Table 2, the performance of HARMLESS with MAML, FOMAML or Reptile greatly differ in some cases. Is there any guideline to choose which meta-learning method should be used in general? ----- [Update after authors' rebuttal] Thank you for the rebuttal. I read it. The points I mentioned are minor (they are just about clarification), so I maintained my score to be 7. Good luck!

Reviewer 3

This paper presents a meta-learning method for learning heterogeneous point process models for short sequence data with a relational network. A hierarchical Bayesian mixture Hawkes process model is proposed to incorporate relational information. The method has been tested on both synthetic and real data. The Bayesian model captures the underlying mixed-community patterns of the relational network. Meanwhile, the model enables knowledge sharing among sequences and facilitates adaptive learning of individual sequences using the model agnostic meta learning technique. A stochastic variational meta-EM algorithm is also derived. My major concern is the performance of the proposed method. In the experiment, the performance of HARMLESS (MAML) is lower than 2 baselines for StackOverflow data. Also, the performance of the 2 HARMLESS methods (FOMAML and Reptile) is much lower than HARMLESS (MAML) for LinkedIn data. The performance of proposed method could be further improved. Moreover, it will be useful to discuss the computation complexity of the proposed method.