NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 1035 Predicting User Activity Level In Point Processes With Mass Transport Equation

### Reviewer 2

This paper provides a framework for analysis of point processes, in particular to predict a function f(N(t)) where N(t) measures the number of times an event has occurred up until a time t. The paper derives an ODE (4) that determines evolution of probability that N takes a particular values at time t (so, the derivative in the ODE is wrt t. After approximating the solution to this ODE, the expectation of f can be taken. This paper is distant from my expertise, so I advise the committee to pay closer attention to other reviews of this paper! From my outsider's perspective, this paper could benefit from an expositional revision at the least. In particular, it's easy to "lose the forest through the trees." The paper gives a nice high-level introduction to the problem and then dives right into math. It'd be great to give a "middle" level of detail (just a paragraph or two) to fill the gap --- e.g. make all the variables lambda, phi, f, N, H_t, ... concrete for a single application---including a clear statement of the input and output data (this wasn't clear to me!). Roughly just a few sentences to the effect "As an example, suppose we wish to predict X from Y. We are given Z (notated using variables W in our paper) and wish to predict A (notated B in our paper)." This would help me contextualize the math quite a bit---I could follow the equations formally but was struggling on an applications side to understand what was going on. Note: * Given the recent popularity of "optimal transport" in machine learning, it might be worth noting that here you refer to "mass transport" as the ODE/PDE for moving around mass rather than anything having to do with e.g. Wasserstein distances. * Theorem 2 is quite natural, and before the proof sketch I'd recommend explaining what's going on in a few words. All this equation is saying is that phi(x,t) changes for two potential reasons: An event occurs in the regime that hasn't reached phi(x,t), bumping up phi(x-1,t) --- this is the second term --- or an extra event occurs and bumps things up to phi(x+1,t) --- the first term. * l.153 --- why should the upper bound for x be known a priori? * l.190(ii) --- is it clear at this point what you mean by "compute the intensity function" --- this seems vague * Paragraph starting l.207 -- what is u?

### Reviewer 3

The authors propose Hybrid, a framework to estimate the probability mass function for point processes. They reduce the problem to estimating the mass function conditioned on a given history, and solve the mass transport equation on each intervals to obtain the mass for the future, until the desired time t. This method is shown to have better sampling efficiency compared to MC. I would like the authors to be more clear on the following points: (1) Regarding applying this framework, is the intensity function \lambda(t) known (pre-defined) or learned from the data? I assumed that it is pre-defined since the first step (line 190) of generating history samples depends on the intensity function. (2) When comparing the running time in line 52 and line 269, the criteria is to achieve the same MAPE on \mu. Does the same conclusion still hold when the target is the MAPE of the probability mass function (i.e., P(N(t))) itself?

### Reviewer 4

In summary, this work is concerned with variance reduction via Rao-Blackwellization for general point processes. The main insight is to condition on the filtration generated by the process and solve the differential equation describing the time evolution of the corresponding conditional distributions. The latter is facilitated by working on discrete state spaces, made finite with suitable truncation and the use of numerical integration to perform time discretization. Although I find the contribution novel and am convinced of its utility in practical applications, I would like the author/s to be more transparent about the impact of the above-mentioned spatial truncation and temporal discretization. In particular, the claim that their proposed methodology returns unbiased estimators is inaccurate when these approximations are taken into account. That said, I believe debiasing methods described by McLeish (2010) and Rhee and Glynn (2015) may be employed here. Typo on page 5, line 190: the form of the estimator for expectation of test function f is missing averaging over replicates.