Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes a class of random functions where each member is a spline function with the parameters produced by a neural network from Gaussian noise. The first contribution of the paper is the capability of enforcing non-negative constraints over the splines via the alternating projection method over the output of the neural network. The proposed set of spline functions are non-negative and smooth, so they are good candidate to model the intensity functions of temporal point processes. The second contribution of the paper is thus to use smooth non-negative splines to model temporal point processes which makes less strict structural assumptions of the parametric form of the intensity function. Exploring new expressive processes is one of the important problems in the domain of point processes, and this paper advances knowledge in this area. Even though the overall scheme of the paper is excessively complicated, each section of it is well motivated, and I can feel the authors try to articulate it carefully. Experiments on both synthetic and real temporal event datasets show improvements over three sophisticated baselines. Since the paper essentially provides a non-parametric approach for estimating the intensity function of point processes, it will be more convincing to compare with existing processes that are also non-parametric but much simpler. For example, in “Learning Networks of Heterogeneous Influence, NIPS 2012”, they temporal decaying triggering kernel can be formulated as a combination of basis kernel so that it can be learned in a non-parametric way. In “Decoupling Homophily and Reciprocity with Latent Space Network Models”, we can use a sin kernel to capture the periodicity of the temporal events. Moreover, since the approach utilizes the neural network to learn representations. It is also important to mention some recent methods of neural network based point processes, e.g., “The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process”, “Recurrent Marked Temporal Point Processes: Embedding Event History to Vector”. It is also suggested to conduct a goodness-of-fit evaluation by using the QQ-Plot to test how well the proposed temporal point process is able to represent the true observed temporal events. Finally, can the paper elaborate more about the training and inference scalability of the proposed approach with respect to the number of spike trains, splint knots, hidden dimension m, e.g. as it is expected that large intervals require more knots to keep certain level of accuracy.
AFTER AUTHOR FEEDBACK I think the author has addressed my comments/concerns, and I am happy to keep my current score. ------------------------------------------------------------------------------------------------------- BEFORE AUTHOR FEEDBACK I think this is a very well written paper. The arguments presented are sound and they are backed by theoretical investigations and empirical evidence. The authors also gave careful considerations to various essential technical details in order for the proposed method to work in practice. Moreover, the proposed method is flexible and it can be generalized to other constraints. In addition, the authors explained why other possibly more naive methods fail and I found this to be useful for me to debunk any pre-conception I had about this paper. The proposed method is quite complicated and this made me wonder whether any simplification is possible? Now since B-splines bases are positive, nonnegativity can be achieved simply by constraining the basis coefficients to be nonnegative (or between 0 and 1 for density estimation). Then Shen and Ghosal (2015) on adaptive random series priors showed that these constraints have good approximation properties for nonnegative functions (see the Appendix about B-splines). Hence instead of using the representation given by Lesserre, I am wondering whether it would be more efficient to consider the nonnegative characterization mentioned above by working with nonnegative scalar basis coefficients rather than nonnegative-definite matrices.
This manuscript has strong merits on a theoretical level -- it studies an important and wide-ranging class of models by providing an innovative and creative parameterization of smooth, random functions. It is not yet clear to me whether this method will be very easy to use in practice, or whether adding appropriate modifications/approximations into Gaussian Process methods will be sufficient for practical purposes. This paper only explores incorporating a nonnegativity constraint on the function class. However, this doesn't seem to be a serious challenge to methods like pp-GPFA, which pass the GP through an exponential function and use black box variational inference to fit the model. While the author's state that they outperform GPFA, this isn't to surprising since their VAE allows for a nonlinear mapping from the latent space, while GPFA is restricted to a linear mapping. In other words, the improvement could be attributed to other factors that are not the core message of the work. (Perhaps the authors could argue/show that GPs + BB variational inference wouldn't scale to fit a similar VAE?) The way the authors control for degrees of freedom between their various models in the model comparison section is confusing to me. The bin size seems like it should be optimized on a per-model basis, and the only thing kept constant across models is the dimension of the latent representation. Adding more bins for the spike times does not make the latent representation less interpretable, and I wouldn't expect it to be very computationally costly. So why handicap these baseline models? Finally, I can think of a couple potential advantages of Gaussian Process methods over the DRS approach. I don't view these as critical shortcomings of the paper; however, it might be useful for the authors to directly address or clarify these points: 1) By specifying the kernel/covariance functions of the GP, practitioners can control the power spectrum and add (for example) periodic structure into the model. Additionally, this property makes GPs very attractive for certain analytic derivations. Controlling the frequency content in DRS would seem more challenging. 2) DRS, like many deep networks, may be challenging to optimize and require substantial hyperparameter tuning for different problems, whereas GPs are not typically combined with deep nets and thus are easier to optimize.