Reviews: Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

Originality: the use use of the Legendre polynomial seems rather creative, it was certainly important to define RNNs with good models of coupled linear units. Quality: The set of benchmarks is well chosen to describe a broad scope of qualities that RNN require. One non-artificial task would have been a plus though. What would have been even more important is to support the theory by controlling the importance of the initialization of the Matrices A and B. What if A was initialized with a clever diagonal (for instance the diagonal of A_bar)? As the architecture is already rather close to the one of NRU, one may wonder whether the architecture is not doing most of the job. On a similar topic, the authors say that A and B may or may not be learned. I might find it useful to know for which tasks are the matrices A, B and W_m being learned? Clarify: The paper is very well written. Only the section on Spiking neural network could be a little better described. I do not find it clear, what is the challenge discussed in the neural precision section? What is the "error" mentioned line 268? I understood that this spiking neural network only models the linear part of the LMU and not the non-linear units, but I remain unsure because of line 284. Why in this linear scaling (l. 285 to 289) particularly good? in comparison to what is that good? I would assume that a dedicated digital hardware transmitting floating numbers but exploiting this clever connectivity would be much more accurate than the noisy Poisson network and avoid using p neurons per memory unit. ---- Second Review ----- Thanks to the authors for giving numerical results with other choices of A_bar. I think the paper would greatly benefit from the insertion of one or two sentences about that to prove that the mathematical depth the paper is not a decoration but leads to untrivial improvements on machine learning benchmarks. However, I also find it hard to believe that the diagonal of A_bar alone leads to chance level on permuted sequential MNIST. I am not satisfied with the justification of the authors that this is due to a bad parameter initialization and I think it would deserve a bit more analysis to either give a more convincing reason or find the trick to make A_bar diagonal work - at least a bit - in this setup. It looks like this naive change of A_bar is ill-defined if it achieves chance level. I think that after some rewriting efforts on the spiking hardware section to clarify what is done and the relevance of that approach, the paper will become an excellent paper.

Reviewer 2

This paper clearly differentiates its novelty from the work on which it builds. The quality of the paper is good. The paper is mostly clear; but the abstract could be made clearer by focusing more at the beginning on the impact of the LMU than on how it works. Also, throughout the paper I was wondering whether this work was all simulation or whether it was run on real LMU hardware. The paper mentions BrainDrop and Loithi multiple times, and Section 5 suggests a hardware implementation too; but in the end, I decide everything was simulated because you pointed to code in GitHub. But you didn't say whether that code needed special hardware to run. So which is correct? It would help to clarify that early on in the paper. As for significance, this work will be of interest to the neural chip researchers at NeurIPS. I definitely enjoyed that the authors were able to use references from 1892 and 1782! :-)

Reviewer 3

I read the other reviews and the author feedback addresses the raised questions / concerns, in particular about the dimensionality of u and comparison to phased LSTM. I think this paper addresses a relevant problem (limited memory horizon), presents a novel approach (LMU), and nicely analyses this approach theoretically and empirically. I raise my score to 8. Originality: Even though the proposed approach is based on the work "Improving Spiking Dynamical Networks: Accurate Delays, Higher-Order Synapses, and Time Cells", its transfer to deep neural networks is new and of significance. In the related work, "Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences" could be discussed and also used as a baseline. Quality: The proposed approach is well described and seems easy to implement from the given description in the text. The provided theoretical formulation for the continuous-time case yields a valuable interpretation / intuition and experiments are well tailored to show-case the strengths of LMU module, in particular the increased memory horizon. Also chapters 4 & 5 describing the characteristics of the LMU and a spiking variant are helpful and provide a summary of interesting features. Clarity: The work is well written and structured. Figures and tables are of proper quality. It seems, that one could easily reproduce the results with the provided code and descriptions.

Paper ID:	9024
Title:	Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

Reviewer 1

Reviewer 2

Reviewer 3