NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper designed a model with goal of forecasting on high dimensional time series data. To achieve its goal the model used LSTM network to capture the transition of latent states. In addition, to convert the latent states to observation domain, it has used the Gaussian copula process in which Gaussian process models a low rank covariance matrix which is computationally less complex to infer the parameters. Also the authors used Gaussian copula to convert the non-Gaussian observation to an standard Gaussian distribution. This will help them to enhance prediction power of the model since it converts non-Gaussian observation to have a standard Gaussian distribution. So we can summarize the contribution of this paper as following - The Paper tries to solve maximum likelihood problem with high dimensional observation domain. To use mini-batchs of data in training (use a few time series as mini-batch), authors propose Gaussian process models with low rank covariance matrix in which each observation have a non-time varying component which is learned by gaussian process and time varying component which is learned by the LSTM. This design makes the mini-batch learning possible. - To convert the non-gaussian and potentially heavy tail distributed data, authors used Gaussian copula to convert the non-Gaussian observation to a variable with standard Gaussian distribution. This will help them to enhance prediction power of the model. In experiments section, authors demonstrated application of their method using synthetic and real data and showed the proposed method have outperformed the competing auto regressive algorithms in many cases. Appendix contains details of experiment and hyperparameters setting for both propse model and competing algorithms. Quality Motivation, claims and and supporting material in main paper are explained well and the paper does not contain any significant theoretical or complicated design details . The quality of experimental results is good and all the hyper parameter setting and details of experiments have been explained well. Clarity: I think the paper objectives and explanation are pretty clear and flow of material is very smooth. There are some small issues needs to be fixed - line 132 R^d and R^d\times d → R^N and R^N\times N - supplementary line 6 remove the . From beginning of sentence Originality: As mentioned in summary the main contribution of this paper could be summarized as bellow - The Paper tries to solve maximum likelihood problem with high dimensional observation domain. To use mini-batchs of data in training (use a few time series as mini-batch), authors propose Gaussian process models with low rank covariance matrix in which each observation have a non-time varying component which is learned by gaussian process and time varying component which is learned by the LSTM. This design makes the mini-batch learning possible. - To convert the non-gaussian and potentially heavy tail distributed data, authors used Gaussian copula to convert the non-Gaussian observation to a variable with standard Gaussian distribution. This will help them to enhance prediction power of the model. The paper does not seem to have enough original contribution. Authors mostly have adopted existing techniques (see references below ) and algorithms and combined them together without showing much interpretation or theoretical results. Also method does not have proposed any smart regularization or parameters setting technique to avoid over-fitting and/or under-fitting. for Gaussian copula and variable transformation: - Aussenegg, Wolfgang, and Christian Cech. "A new copula approach for high-dimensional real world portfolios." University of Applied Sciences bfi Vienna, Austria. Working paper series 68.2012 (2012): 1-26. -Liu, Han, et al. "High-dimensional semiparametric Gaussian copula graphical models." The Annals of Statistics 40.4 (2012): 2293-2326. for low rank parameter estimation: similar to Liu, Haitao, et al. "When Gaussian process meets big data: A review of scalable GPs." arXiv preprint arXiv:1807.01065(2018) - and those refrenced by author in line 52 Significance: The experiment section shows extensive experiment and relative success in comparison to other competing algorithms, but I would have some concern about syntactic data. The synthetic data are simple periodic data with the same period along all dimensions and I had expected that predicted line follow the synthetic data much more closely. Also as main goal of the paper is to perform the superior forecasting, it will be fair that results will be compared to paper below since these two papers try to solve the same problem and goal of both are superior forecasting power. Sen, Rajat, Hsiang-Fu Yu, and Inderjit Dhillon. "Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting." arXiv preprint arXiv:1905.03806 (2019) (I know it will be difficult as author of this paper has not shared the code yet) The other state-space model that can be considered is following Johnson, Matthew, et al. "Composing graphical models with neural networks for structured representations and fast inference." Advances in neural information processing systems. 2016.
Reviewer 2
This paper proposed a global and local combination method to forecast the multivariate time series. From a global perspective, the authors used the Gaussian Copula function to characterize dependency between multivariate time series. This results in low-rank estimation. From a local perspective, several local RNN-based TSF models are combined to estimate the sparse parameters. The motivations and working mechanism are lucid. However, there are some weaknesses. 1. Experimental contrast is not enough. In Table 1 and Table 2, the conventional models, such as VAR and GARCH are weaker here. It is more significant to validate that the model in this paper with the same settings can improve the performance of several state-of-the-art deep learning models. 2. The number of Monte Carlo sampling for sparse estimation is unspecified since random sampling does not seem to formalize the temporal evolution of time series. In other words, the temporal dependency and time-varying distribution of each individual time series also should be considered; otherwise, the number of samples may be a core parameter which influences the effectiveness of this model. Besides, in the case of independent random sampling, simply using RNNs and LSTMs to fit the underlying forecasting functions is not convincing. Furthermore, excessive sampling may increase the complexity of the entire model, which is not explicitly mentioned in this paper.
Reviewer 3
- Originality: I am not that familiar with the space of high-dimensional forecasting with modern deep learning methods. - Quality: This paper appears technically sound, the ideas are sensible and the experiments do a good job empirically testing the approach. Like all empirical investigations, more can be done --- in this particular case, there were are some tuning parameters that could affect performance that require (e.g. time series embedding vector size M and MSE/log like or time-varying residual analysis). - Clarity: This paper is very clearly written and the details of the model and training algorithm are thoroughly described. Some questions below address clarity. Other specific questions and comments: - line 76: This was unclear to me --- the pieces are of size epsilon^{-N}? Or there are that many pieces? - line 82: "order of magnitude larger than previously reported" needs a citation and to be made more precise. - line 85: "a principled, copula-based approach" is vague --- what principles are you referring to? - line 105: "...LSTM is unrolled for each time series separately, but parameters are tied across times series" --- what assumptions about the data does this particular model constraint encode? - line 124: Are all training chunks of size T + \tau? How well does look-ahead forecasting perform when the number of steps is greater than or less than \tau? Can this be made more robust by increasing or decreasing training chunk sizes? - line 135: How are these empirical CDF marginal distributions specified within the model? Do these distributions describe the observed marginal distribution of time series data? Or do they model the distribution of the residual given the model mean? - line 141: How does the discretization level m (here m=100) affect speed and model prediction accuracy? - line 164: How important is the embedding e_i? How does forecasting perform as a function of feature vector size E? - line 175: I don't fully follow the logic --- why does this Gaussian process view enable mini-batches? - line 186: How long does the time series need to be to fit a large LSTM that captures the dynamic covariance structure? In this synthetic example, how does the approach deteriorate (or hold up) as T shrinks? - line 225: regarding CRPS: it would be nice to give a short, intuitive explanation of CRPS and how it is different from other metrics, like log likelihood or MSE? Why not report MSE and log likelihood a well? - Table 2: How are the CRPS-sum error bars being computed? - Line 242: Besides the larger test error and a higher number of parameters do you see other signs that the Vec-LSTM is over-fitting (e.g. train vs test error)?