NeurIPS 2020

### Review 1

Summary and Contributions: This work considers the problem of reconstructing or analyzing a multidimensional time series governed by a deterministic dynamical system when one has access only to measurements from a single dimension of the time series. A broad literature in the physics community has explored this problem in recent decades, with much work focusing on methods that reconstruct the complete time series from a so-called delay embedding. The contribution of this paper is a method to perform this reconstruction using an autoencoder framework with a regularization technique inspired by a classical method in the physics literature, false nearest neighbors. The paper shows that the regularized autoencoder architecture performs better than a plain autoencoder and three baseline methods from the physics literature.

Strengths: The paper shows that bringing machine learning tools to a widely-used technique in time series analysis can bring performance benefits even in the presence of noise, giving this work the potential to impact the wide array of applications that use these methods. The method is appealingly simple, and the authors argue that it requires less parameter tuning than existing techniques. The method is shown to improve performance over comparison techniques on a broad range of datasets and metrics.

Weaknesses: Despite being an interesting observation and novel idea, the paper has a few key weaknesses both in the writing and in the work reported (detailed in the additional feedback below). In the writing, the technique is motivated as a general method for time series analysis, but is limited to systems governed by an underlying dynamical systems. This should be discussed in some detail, as this connection gets lost in the writing. Second, there are a number of relevant pieces of literature that the authors refer to indirectly but don’t cite explicitly (details below). Most importantly, the most related reference (ref. 35, which also uses an autoencoder to learn the relevant map) is only described with half a sentence. In the work itself, the simulation examples give a nice starting point, but are not comprehensive enough to be a convincing demonstration of the full claims of the paper. It’s not clear why the baselines used for comparisons were selected or if they are state of the art (see point 3 in additional feedback below). For example, as far as I can tell, the closely related ref. 35 is not used as a comparison method at all and it’s not clear why. Second, the authors sell their proposed method as requiring only “essentially one” parameter to be tuned, but there are many other parameters that should be considered tunable (see point 2 in additional feedback below). The authors must make a convincing case that the comparison methods are state of the art and that the method isn’t highly sensitive to the additional parameters discussed in point 2 in additional feedback.

Correctness: The paper is primarily an algorithmic paper. The development seems correct.

Clarity: Overall, the paper is very well written and organized. I believe the background and problem setup would benefit from a more detailed exposition on state space methods and delay embeddings.

Relation to Prior Work: The paper presents a thorough background on delay embedding methods (with the exception of Ref. 35, which seems to be the most closely-related method). The author mentions isometric versions of Whitney’s embedding theorem, but doesn’t cite some of the recent work in this area. I would include: R. G. Baraniuk and M. B. Wakin, Random Projections of Smooth Manifolds, Foundations of Computational Mathematics, vol. 9, no. 1, pp. 51-77, February 2009. A. Eftekhari and M. B. Wakin, New Analysis of Manifold Embeddings and Signal Recovery from Compressive Measurements, Applied and Computational Harmonic Analysis, vol. 39, no. 1, pp. 67-109, July 2015. K. L. Clarkson. Tighter bounds for random projections of manifolds. In Proc. Symp. Comput. Geom., pages 39–48. ACM, 2008. Similarly, the authors mention isometric (stable) versions of Takens’ theorem and should discuss the relationship to: A. Eftekhari, H. L. Yap, M. B. Wakin, and C. J. Rozell, Stabilizing Embedology: Geometry-Preserving Delay-Coordinate Maps, Physical Review E, vol. 97, no. 2, pp. 022222, February 2018.

Reproducibility: Yes

### Review 2

Summary and Contributions: The paper presents a novel approach for reconstructing the underlying attractors of dynamical systems from partial (incomplete) observations of the system. The paper suggests to first augment the dimensionality using the method of lags/delays, and then reconstruct the data using a regularized autoencoder. Results are presented on multiple dynamical systems, both with known and unknown dynamics.

Strengths: The paper is well written and most parts are clearly described. The machine learning methodology is well presented, and the problem is relevant to the ML community. The experimental results are very extensive and promising. The discussion in the broader impact on privacy is very relevant, and raises an important concern if the underlying attractor can be inferred from partial observations.

Weaknesses: Overall, I like the methodology a lot, but as detailed in my comments, I am a bit concerned about the framing of the problem which might create some confusion, especially from the point of view of dynamical systems. The other aspect that needs improvement is the discussion on the different types of dimensionality introduced, as detailed below. 1) Many recent methods that try to recover an underlying attractor from partial observations (observation of only some part of the system) use the method of delays (aka, Takens embedding, time-lagged embedding, delay-coordinate maps). The current paper also uses this delay embedding (L76) but without making this connection clear. I find the two paragraphs L81-L110 fairly confusing (see also the comment below on dimensionality). a. L3-8: The current approach is presented in opposition to existing techniques that do dimension reduction. Many methods (Singular Spectrum Analysis, Nonlinear Laplacian Spectral Analysis, etc) that rely on a partial observation of the system, first embed these observations into a higher-dimensional space (via the method of delays) and then do some dimension reduction (via eigendecomposition of the covariance matrix, Laplacian method, Koopman operator, etc). The current paper does the same thing, first embeds the univariate observations via the method of delays (L76), and then uses an autoencoder to find a representation in a latent space (which can also be seen as a dimension reduction). b. L100: In practice, g is constructed from a composition between the time delays and often some eigendecomposition, not just the time delays. 2) There are many definitions of dimensionality (d, d_E, d_F, L, d_m, T) which creates some confusion: a. What is the connection between T and d_E? Are they both referring to the embedding dimension in the delay-coordinate embedding space, i.e., the number of delays/lags used to build the embedded’’ attractor? Here, I think there might also be a confusion around the notion of embedding space’’ as there are two spaces used: 1) the time-lagged embedding space, and 2) embedding space (latent space) of the autoencoder. b. In the end, I suppose an important goal is to find the true dimension of the dynamical system (d) (also Sect. 5.3). What is the connection between d and L ? L is chosen as the dimension of the inner layer of the autoencoder, and in experiments, L=10. But can we just choose L as large as possible, regardless of the intrinsic dimensionality d? For unknown attractors, is there a guarantee that L > d? Is there a connection between d and d_E? L163-164: true’’ embedding dimension d_E: this is confusing because I am not sure what d_E refers to exactly. If d_E is used as in L101 (equivalent to T in L76), is there a true (delay coordinate) embedding dimension? If d_E is similar to d as in L143, what is the difference between them?

Correctness: The method is correct, but some claims are not well explained. The empirical methodology is mostly well detailed, however there are too many evaluation metrics in my opinion without sufficient details to fully understand how well they work.

Clarity: The paper is well written and most parts are easy to follow, but there are some inconsistencies, some detailed below.

Relation to Prior Work: I believe this is the main shortcoming of the paper. There are many methods that have similar approaches, however these connections are not made clear in the paper (especially Sect. 2). Is there a difference between the method of lags that the authors use and the method of delays/time-lagged embedding from the state-of-the art?

Reproducibility: Yes

Additional Feedback: Other comments: 1. Why strange in the title? Does this approach only apply to strange attractors (i.e., they have a fractal structure)? Are all the real world datasets from the experiments generated by strange attractors, eg, neural spiking, old faithful, etc? 2. The methodology (L72-80) is presented for univariate time series (multidimensional time series are proposed as future work – L319). However the authors use “low-dimensional” often throughout the paper, but this can be higher than one-dimensional. The suggestion would be to either: 1) extend the methodology to multivariate data, or 2) state in the abstract (and elsewhere) that this is for univariate time series, not “low-dimensional”. 3. L21: incomplete observations is often referred in dynamical systems as “partial observations”. All that is available is some observation of the dynamical system, which need not be a direct observation of one of the states (as in the Lorenz example from the paper). 4. L27-28: recent – recently 5. L81: \theta subscript not used again 6. L84-86: The discussion about which type of similarity between Y and \hat{Y} is sought for is too brief and not clear, as this is a very important aspect. 7. L87 and L105: \mathcal{M} is the measurement operator, not the measurement. The measurement is x 8. L101: \tau is the timestep, similar to x_i in L76 (which is the discrete analog and has it’s own timestep). If I understand correctly the authors’ goal, d_E=T? 9. L121: [30] does not decompose the Hankel matrix, but a Laplacian matrix built from the Hankel matrix, and on top of that they do an SVD (which is not always needed). Singular Spectrum Analysis (SSA) [Ghil et al., 2002] decomposes directly the Hankel matrix. 10. Sect. 4.1: Fig. 2 is not mentioned in the text. It might be worthwhile mentioning that g is the encoder and g’ the decoder. 11. Sect. 4.2: How is the split between train-validation-test done? If this is done randomly, due to the Hankel matrix (time lags/delays) then one could have an important overlap of information between the three datasets. 12. L214: If L would be chosen much larger (especially for unknown attractors) what is the impact of augmenting the dimensionality be adding many zeros? 13. Fig. 3: there is no A and B and the similarity measures are slightly different than described in the text. What are the values in the similarity plots? The colorbar would be very useful. 14. Fig. 4: What does the negative forecast accuracy mean? How is the forecast accuracy defined exactly? Would be interesting for Fig. 4A to understand if using different observables, y1 vs y2 vs y3, leads to the same attractor. Caption: \tau and \eps_0 are for B and C, instead of A and B. 15. S_dim is not defined, but I assume it refers to the dimension similarity. 16. Fig. 5: What is the exact definition of dimensionality error? What does a dimensionality error of 2.5 mean? I would believe it should be integers, but I might be misunderstanding something. ---------------------------------------------------- Rebuttal: I would like to thank the authors for their responses. I have read the responses carefully, however unfortunately the distinction between the different dimensionalities used is still not clear to me. For example, it is not clear if the purpose is to learn d (the true number of ODEs, the intrinsic dimensionality) or d_E (the number of lags in line 101). Unfortunately the authors did not respond to some of my questions that are still open: • What is the connection between T and d_E? If \tau = 1, I suppose they are the same. In the response, line 54, the authors say d_E \approx d (not T). • Why strange in the title? Does this approach only apply to strange attractors (i.e., they have a fractal structure)? Are all the real world datasets from the experiments generated by strange attractors, eg, neural spiking, old faithful? (previous comment) • Is d_F needed? Which of the examples has a fractal structure? However, overall my biggest concern is the ambiguity between d and d_E. What is the main goal: to learn d or d_E? In line 9 of the response, d_E is reffered to as the “embedding dimension”, which I assume refers to the embedding dimension of the latent space of the autoencoder. But line 101 from the paper shows clearly that d_E is the number of lags, so the dimension of the delay-coordinate embedding, not of the latent space of the autoencoder. In the response, the author mention that the method and code have been extended to multivariate data. Are there any results added to the paper to show how the method is applied to multivariate data? I am not so much concerned by the quality of the results, as by the consistency between the method formulation and the results. If the method was modified to also apply to multivariate data, the results should contain at least one example on multivariate data. Small comments: • Response (b): “For unknown systems any large L can be used” – unfortunately I don’t fully agree with this statement. As the authors correctly state in their responses, the correct attractor is reconstructed “as long as L is larger than d”. But for unknown systems, and for very complex systems, there is no guarantee that the chosen L is larger than d. This remains an assumption (that the chosen L is larger than d), and it would be helpful to make this clear. • response (a): “L sets the maximum d_E expressible by the autoencoder” (line 9) or the maximum d (true number of ODEs)? What is the connection between d and d_E for unknown systems? • The authors find that d \approx d_E. Is this because all previous results were on univariate data? Is the same observed for multivariate data? For example, for a 100D system, if we observe data from 99 of the ODEs, do we still need d_E=100? Maybe it’s also from here the confusion between d and d_E. I do believe the paper offers a novel view of using autoencoders with delay embeddings and has a strong potential, however due to the inconsistencies - mainly related to the goal of learning d or d_E, and the exact definition of d_E (dimension of the latent space of the autoencoder or the number of lags), I will keep my previous score. I believe these things are central to the paper, and should be stated clearly to avoid any confusion.

### Review 3

Summary and Contributions: This paper presents a method for reconstructing the attractor of a dynamical system when only low-dimensional time-series measurements of the system are available. The authors created attractors from time series by training an autoencoder on the Hankel measurement matrix for the system.

Strengths: Experiments show that the technique reconstructs the strange attractors of synthetic and real-world systems better than existing techniques, and that it creates consistent, predictive representations of different systems.

Weaknesses: There have been a lot work related to the state space reconstruction for time series. Several studies leverages singular-value decomposition of the Hankel matrix to reconstruct the state space. Thera have also been some works that used nonlinear kernel function to map time series into a high dimensional state space. This paper is a natural extension of previous work, which utilized autoencoder on the Hankel matrix to achieve state space reconstruction. The innovation of the paper is not significant.

Correctness: The claim and proposed method is correct in this paper.

Clarity: It is well written.

Relation to Prior Work: The work is different from previous contributions. But the innovation is not significant.

Reproducibility: Yes

### Review 4

Summary and Contributions: The paper shows how to reconstruct a dynamical system (attractor) from a low-dimensional measurements.

Strengths: The paper presents a method how to actually do this. Takens and Whitney describe and prove that this is possible, but do not provide a method how to actually achieve it. This paper does.

Weaknesses: The method is only demonstrated on three-dimensional systems. It is not clear how well the method scales.

Correctness: The methods are correct and the comparisons are valid.

Clarity: The paper is written very well.

Relation to Prior Work: Yes.

Reproducibility: Yes