Reviews: Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods

This submission presents a range of interesting connections between deep and kernel learning. I find the presentation however rather unusual. In particular, the factorisation of final layer matrix U into A and E is not "convenient" as the authors put it but critical for their exposition. In addition, the choice of e_i in equation (2) and \tilde{z}_i needs to be properly argued for or at least discussed. Furthermore, the lack of overview that would succinctly put the overall approach for linking deep and kernel learning impedes the flow. A diagram or something to this sort would have been immensely helpful. Given how much notation you are using it would have been very helpful again to have a diagram or summary of some sort to help the reader to absorb it. Why the memory cell, though a vector, is capitalised (reserved for matrices) in your work? "Tilded" and regular versions of variables is an important aspect in your work and it should be properly introduced. Overall I believe the submission is sufficiently original, lacks in some respects regarding its quality and clarity and is sufficiently significant to a wide audience in deep and kernel learning communities. After reading the authors response and discussion with other reviewers I am hoping that authors would not only add one diagram but make other changes that would make this very dense submission easier to follow and understand. Therefore I am adding +1 to my previous recommendation.

Reviewer 2

Section 3 & 4 feels mostly independent of kernels, it reads as a train-of-thought discussion on how to motivate and build useful RNN architectures. It fails as a clear description of the actual proposed models, as it takes quite a bit of re-reading to detail and understand what the RKM-LSTM is, for example. For the language modelling experiments, I feel Table 3 should probably acknowledge what is state-of-the-art for WikiText-2 and PTB, as there's an insinuation that the proposed models are SOTA (which they are not). For example, [24] is cited however the numbers in their paper are actually better than what is presented in Table 3. E.g. for [24] PTB is 60.0 valid, 57.3 test; WT2 is 68.6 valid, 65.8 test. But there are more recent variants of the awd-lstm, e.g. "Breaking the Softmax Bottleneck: A High-Rank RNN Language Model" Yang et al. that achieve 63.88 valid and 61.45 test. It is worth noting what state-of-the-art is in results tables, if one is to claim that the proposed model is sota. For the document classification tasks, the gains are extremely minimal over an LSTM. I would heavily suspect a transformer would obliterate these models at these tasks nowadays, and given the lack of appropriate citation for the language modelling task - I am not 100% sure that there are not better performing works on these tasks. The paper could mention Quasi-Recurrent Neural Networks, which appear to be very similar to the n-gram LSTM. I think the general clarity of the paper could be improved, the introduction, Section 2, and the results sections were quite nicely written but S 3, 4, & 5 I found dry and a little arbitrary. As said, it's difficult to even extract what the proposed models are going to be exactly from S4. Unfortunately I don't find the connections between kernels and recurrent neural networks very enlightening, but I think many people do and this could motivate new work, and in sum would consider accepting this paper if it were re-written. ===== Thank you for your response, in light of it I have changed my review to an accept.

Reviewer 3

This ms provides an explanation of RNN, CNN from the kernel perspective. It defines a hidden variable h_t that is dependent on previous time points. The hidden variable h_t depends on the current observation x_t and previous states h_{t-1} through an unknown function f. The predicting variable y_t is depends on h_t by a product of a time invariant factor load matrix A and a dynamic factor matrix E. The author then assumes h_t lives in a Hilbert space, as well as the rows of the dynamic factor matrix E. This assumption makes it possible to represent e_i as the same parametric form of h_t. As a result we can define a kernel for h_t, so we can operate computations in the space of x_t and h_{t-1}. As h_{t-1} and h_t lives in the same space, the kernel is calculated recursively, as shown in equation (6). If we repeat the calculation long enough, q_{\theta}(C_{t-N}) becomes constant. In this recursive process, it is shown that the kernel can be calculated in the space of x. Based on this framework, the ms make some extensions, and provides interpretation to LSTM and CNN as special cases from the recurrent kernel machine perspective.The proposed method achieves comparable results with current state of the art methods in several experiments and improves the performance in a LFP task. I find this ms is enjoyable in general. I have two concerns: 1. is it reasonable to assume e_i lives in the same Hilbert space? A proof of the existence might be helpful. 2. It is said that q_{\theta}(C_{t-N}) can be seen as a vector of biases. What are the conditions that guarantee its convergence?

Paper ID:	1875
Title:	Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods

Reviewer 1

Reviewer 2

Reviewer 3