NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 3472 Bayesian Alignments of Warped Multi-Output Gaussian Processes

### Reviewer 1

This submission presents a "three-layer" Gaussian process for multiple time-series analysis: a layer for transforming the input, a layer for convolutional GP, and a layer for warping the outputs. This is a different "twist" or "favour" of the existing deep-GP model. Approximate inference is via the scalable version of variational inference using inducing points. The authors state that one main contribution is the "closed-form solution for the $\Phi$-statistics for the convolution kernel". Experiments on a real data set from two wind turbines demonstrates its effectiveness over three existing models in terms of test-log-likelihoods. [Quality] This is a quality work, with clear model, approximation and experimental results. In addition, Figure 3 has shown a illustrative comparison with existing models; results against these models are also given in Table 1. One short-coming is that the authors have not considered how their approach is better (perhaps in terms of inference) than a more straightforward model where the alignment is directly placed on the input without convolution. [Clarity] L41: I would describe the model as "nested" rather than "hierarchical", so as not to be confused with Bayesian hierarchical. Section 2: I think this entire section should be rewritten just in terms of time-series, that is, one-dimensional GP, and the bold-faced of $x$ and $z$ removed. This is because L69-L70 describe only a single output $a_{d}$ function, which means $z$ must be single dimension, and hence $x$ is only single dimension. If the multi-dimensional inputs are desired, then the paper has to perhaps use a multi-output function for $a_{d}$. Also, for equation 1, it has to be stated that the functions are applied point-wise. Since the authors cited both [14] and [19] for the "warped" part, I suggest that clearly state that the model follows the spirit of [14] rather than [19]. [Originality] I would dispute the claim on L145 that one "main contribution ... is the derivation of a closed-formed solution for the $\Phi$-statistics". This is because this is only for the RBF kernel, and it is simple a mathematically tedious step to get the expressions, given the previous work of [8] and [25, section B.1]. In addition, once the model is stated, the rest trivially fall into places based on existing work. [Significance] I commend the authors for proposing this model, which I think will be very useful for time-series analysis, and give yet another manner in which GP can be "nested"/"deepened". [25] M. K. Titsias and M. Lazaro-Gredilla. Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression. NIPS, 26, 2013. [Comments on reply] The reply will be more convincing if it can in fact briefly address "how alignments can be generalized to higher dimensions".

### Reviewer 2

This work presents a new Gaussian process model for multiple time series that contain a) different temporal alignments -or time warpings- (which are learned) and b) are related (in the multi-task sense), where the level of relation between tasks is also learned. This is applied to toy data and a very small real-world example of power generated by wind turbines. The model is well described and the diagrams used to explain it are useful to understand it. Relevant connections with existing literature (multi-ouput GPs, deep GPs, warped GPs, etc) are established. Connection with previous literature on GPs for time series alignment seems to be missing, for instance "Time series alignment with Gaussian processes", N Suematsu, A Hayashi - Pattern Recognition (ICPR), 2012. The model and the corresponding inference scheme, despite is laborious derivation, is by itself not very novel. It is a relatively straightforward combination of ideas from the above mentioned literature and requires the same tricks of introducing pseudo-inputs and variational inference to obtain a lower bound on the evidence. The main contribution could be providing the concrete derivation of the Phi-statistics for this particular model. I think the derivations are sound. However, I am more concerned about the practical applicability of this model: - The model is only applied to toy(ish) data. The computational cost is not discussed and might be very large for larger datasets, and with more than two signals. - The relation between tasks is governed by the set of hyperparameters \ell_{dd'}. In the simple examples given with only two tasks, this might work well. For cases with many signals, learning them might be subject to many local minima. - The learning of this model is very non-convex, since there are potentially multiple alignments and task-relatedness levels that achieve good performance (there are even multiple degenerate solutions for more that two outputs), so the hyperparameter optimization and inference can get very tricky for any non-trivial size dataset. Minor: - The clarity of the exposition could be improved. For instance, my understanding is that X is actually always one-dimensional and refers to time, but this isn't clear from the exposition of the paper. I incorrectly thought those would correspond to the locations of the turbines. - Continuing with the previous point, am I correct that the location of the turbines is not used at all? That would seem to provide valuable information about task relatedness. - There could be a comment mentioning that learning \ell_{dd'} is what learns the "task relatedness". After reading the authors' response: The response from the authors doesn't address the questions that I raised above. What's the computational cost? The authors manage to "address" this latter issue in no less than 11 lines of text without giving the computational complexity of the algorithm nor the runtimes. Also, they don't define X properly in the rebuttal either.