{"title": "Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 3392, "page_last": 3403, "abstract": "We investigate time-dependent data analysis from the perspective of recurrent kernel machines, from which models with hidden units and gated memory cells arise naturally. By considering dynamic gating of the memory cell, a model closely related to the long short-term memory (LSTM) recurrent neural network is derived. Extending this setup to $n$-gram filters, the convolutional neural network (CNN), Gated CNN, and recurrent additive network (RAN) are also recovered as special cases. Our analysis provides a new perspective on the LSTM, while also extending it to $n$-gram convolutional filters. Experiments are performed on natural language processing tasks and on analysis of local field potentials (neuroscience). We demonstrate that the variants we derive from kernels perform on par or even better than traditional neural methods. For the neuroscience application, the new models demonstrate significant improvements relative to the prior state of the art.", "full_text": "Kernel-Based Approaches for Sequence Modeling:\n\nConnections to Neural Methods\n\nKevin J Liang\u2217 Guoyin Wang\u2217 Yitong Li Ricardo Henao Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\nDuke University\n\n{kevin.liang, guoyin.wang, yitong.li, ricardo.henao, lcarin}@duke.edu\n\nAbstract\n\nWe investigate time-dependent data analysis from the perspective of recurrent\nkernel machines, from which models with hidden units and gated memory cells\narise naturally. By considering dynamic gating of the memory cell, a model closely\nrelated to the long short-term memory (LSTM) recurrent neural network is derived.\nExtending this setup to n-gram \ufb01lters, the convolutional neural network (CNN),\nGated CNN, and recurrent additive network (RAN) are also recovered as special\ncases. Our analysis provides a new perspective on the LSTM, while also extending\nit to n-gram convolutional \ufb01lters. Experiments1 are performed on natural language\nprocessing tasks and on analysis of local \ufb01eld potentials (neuroscience). We\ndemonstrate that the variants we derive from kernels perform on par or even better\nthan traditional neural methods. For the neuroscience application, the new models\ndemonstrate signi\ufb01cant improvements relative to the prior state of the art.\n\nIntroduction\n\n1\nThere has been signi\ufb01cant recent effort directed at connecting deep learning to kernel machines\n[1, 5, 23, 36]. Speci\ufb01cally, it has been recognized that a deep neural network may be viewed as\nconstituting a feature mapping x \u2192 \u03d5\u03b8(x), for input data x \u2208 Rm. The nonlinear function \u03d5\u03b8(x),\nwith model parameters \u03b8, has an output that corresponds to a d-dimensional feature vector; \u03d5\u03b8(x)\nmay be viewed as a mapping of x to a Hilbert space H, where H \u2282 Rd. The \ufb01nal layer of deep\n\u03d5\u03b8(x), with weight vector \u03c9 \u2208 H; for\nneural networks typically corresponds to an inner product \u03c9\n(cid:124)\na vector output, there are multiple \u03c9, with \u03c9\ni \u03d5\u03b8(x) de\ufb01ning the i-th component of the output. For\nexample, in a deep convolutional neural network (CNN) [19], \u03d5\u03b8(x) is a function de\ufb01ned by the\nmultiple convolutional layers, the output of which is a d-dimensional feature map; \u03c9 represents the\nfully-connected layer that imposes inner products on the feature map. Learning \u03c9 and \u03b8, i.e., the\ncumulative neural network parameters, may be interpreted as learning within a reproducing kernel\nHilbert space (RKHS) [4], with \u03c9 the function in H; \u03d5\u03b8(x) represents the mapping from the space of\nthe input x to H, with associated kernel k\u03b8(x, x(cid:48)) = \u03d5\u03b8(x)\nInsights garnered about neural networks from the perspective of kernel machines provide valuable\ntheoretical underpinnings, helping to explain why such models work well in practice. As an example,\nthe RKHS perspective helps explain invariance and stability of deep models, as a consequence of\nthe smoothness properties of an appropriate RKHS to variations in the input x [5, 23]. Further, such\ninsights provide the opportunity for the development of new models.\nMost prior research on connecting neural networks to kernel machines has assumed a single input x,\ne.g., image analysis in the context of a CNN [1, 5, 23]. However, the recurrent neural network (RNN)\nhas also received renewed interest for analysis of sequential data. For example, long short-term\n\n\u03d5\u03b8(x(cid:48)), where x(cid:48) is another input.\n\n(cid:124)\n\n(cid:124)\n\n\u2217These authors contributed equally to this work.\n1Implementations can be found at https://github.com/kevinjliang/kernels2rnns.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmemory (LSTM) [15, 13] and the gated recurrent unit (GRU) [9] have become fundamental elements\nin many natural language processing (NLP) pipelines [16, 9, 12]. In this context, a sequence of\ndata vectors (. . . , xt\u22121, xt, xt+1, . . . ) is analyzed, and the aforementioned single-input models are\ninappropriate.\nIn this paper, we extend to recurrent neural networks (RNNs) the concept of analyzing neural\nnetworks from the perspective of kernel machines. Leveraging recent work on recurrent kernel\nmachines (RKMs) for sequential data [14], we make new connections between RKMs and RNNs,\nshowing how RNNs may be constructed in terms of recurrent kernel machines, using simple \ufb01lters.\nWe demonstrate that these recurrent kernel machines are composed of a memory cell that is updated\nsequentially as new data come in, as well as in terms of a (distinct) hidden unit. A recurrent model\nthat employs a memory cell and a hidden unit evokes ideas from the LSTM. However, within the\nrecurrent kernel machine representation of a basic RNN, the rate at which memory fades with time\nis \ufb01xed. To impose adaptivity within the recurrent kernel machine, we introduce adaptive gating\nelements on the updated and prior components of the memory cell, and we also impose a gating\nnetwork on the output of the model. We demonstrate that the result of this re\ufb01nement of the recurrent\nkernel machine is a model closely related to the LSTM, providing new insights on the LSTM and its\nconnection to kernel machines.\nContinuing with this framework, we also introduce new concepts to models of the LSTM type. The\nre\ufb01ned LSTM framework may be viewed as convolving learned \ufb01lters across the input sequence and\nusing the convolutional output to constitute the time-dependent memory cell. Multiple \ufb01lters, possibly\nof different temporal lengths, can be utilized, like in the CNN. One recovers the CNN [18, 37, 17]\nand Gated CNN [10] models of sequential data as special cases, by turning off elements of the new\nLSTM setup. From another perspective, we demonstrate that the new LSTM-like model may be\nviewed as introducing gated memory cells and feedback to a CNN model of sequential data.\nIn addition to developing the aforementioned models for sequential data, we demonstrate them in an\nextensive set of experiments, focusing on applications in natural language processing (NLP) and in\nanalysis of multi-channel, time-dependent local \ufb01eld potential (LFP) recordings from mouse brains.\nConcerning the latter, we demonstrate marked improvements in performance of the proposed methods\nrelative to recently-developed alternative approaches [22].\n\n2 Recurrent Kernel Network\nConsider a sequence of vectors (. . . , xt\u22121, xt, xt+1, . . . ), with xt \u2208 Rm. For a language model, xt\nis the embedding vector for the t-th word wt in a sequence of words. To model this sequence, we\nintroduce yt = U ht, with the recurrent hidden variable satisfying\n\nht = f (W (x)xt + W (h)ht\u22121 + b)\n\n(1)\nwhere ht \u2208 Rd, U \u2208 RV \u00d7d, W (x) \u2208 Rd\u00d7m, W (h) \u2208 Rd\u00d7d, and b \u2208 Rd. In the context of a language\nmodel, the vector yt \u2208 RV may be fed into a nonlinear function to predict the next word wt+1 in the\nsequence. Speci\ufb01cally, the probability that wt+1 corresponds to i \u2208 {1, . . . , V } in a vocabulary of V\nwords is de\ufb01ned by element i of vector Softmax(yt + \u03b2), with bias \u03b2 \u2208 RV . In classi\ufb01cation, such\nas the LFP-analysis example in Section 6, V is the number of classes under consideration.\nWe constitute the factorization U = AE, where A \u2208 RV \u00d7j and E \u2208 Rj\u00d7d, often with j (cid:28) V .\nHence, we may write yt = Ah(cid:48)\nt = Eht; the columns of A may be viewed as time-invariant\nfactor loadings, and h(cid:48)\nt represents a vector of dynamic factor scores. Let zt = [xt, ht\u22121] represent a\ncolumn vector corresponding to the concatenation of xt and ht\u22121; then ht = f (W (z)zt + b) where\nW (z) = [W (x), W (h)] \u2208 Rd\u00d7(d+m). Computation of Eht corresponds to inner products of the\nrows of E with the vector ht. Let ei \u2208 Rd be a column vector, with elements corresponding to row\ni \u2208 {1, . . . , j} of E. Then component i of h(cid:48)\n(cid:124)\ni ht = e\n\n(2)\nWe view f (W (z)zt + b) as mapping zt into a RKHS H, and vector ei is also assumed to reside within\nH. We consequently assume\n\nt, with h(cid:48)\n\nt is\n\nh(cid:48)\ni,t = e\n\n(cid:124)\ni f (W (z)zt + b)\n\nei = f (W (z) \u02dczi + b)\n\n(3)\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: a) A traditional recurrent neural network (RNN), with the factorization U = AE. b) A\nrecurrent kernel machine (RKM), with an implicit hidden state and recurrence through recursion. c)\nThe recurrent kernel machine expressed in terms of a memory cell.\n\n(cid:124)\ni xt + q\u03b8[\u02dcx\n\n(cid:124)\ni xt\u22121 + q\u03b8[\u02dcx\n\n(cid:124)\ni xt\u22122 + q\u03b8[\u02dcx\n\n(cid:124)\n(cid:124)\ni xt\u22123 + \u02dch\n\u22124ht\u22124]]]]\n\n(5)\n\ne\n\n(cid:124)\n\nwhere \u02dczi = [\u02dcxi, \u02dch0]. Note that here \u02dch0 also depends on index i, which we omit for simplicity; as\ndiscussed below, \u02dcxi will play the primary role when performing computations.\n\n(cid:124)\ni ht = e\n\n(cid:124)\ni f (W (z)zt + b) = f (W (z) \u02dczi + b)\n\nf (W (z)zt + b) = k\u03b8(\u02dczi, zt)\n\n(4)\n(cid:124)\nwhere k\u03b8(\u02dczi, zt) = h(\u02dczi)\nh(zt) is a Mercer kernel [29]. Particular kernel choices correspond to\ndifferent functions f (W (z)zt + b), and \u03b8 is meant to represent kernel parameters that may be adjusted.\n(cid:124)\n1 ht,2 where q\u03b8(\u00b7) is a function of\n(cid:124)\nzt) = \u02dch\nWe initially focus on kernels of the form k\u03b8(\u02dcz, zt) = q\u03b8(\u02dcz\nparameters \u03b8, ht = h(zt), and \u02dch1 is the implicit latent vector associated with the inner product, i.e.,\n\u02dch1 = f (W (x) \u02dcx + W (h)\u02dch0 + b). As discussed below, we will not need to explicitly evaluate ht or \u02dch1\nto evaluate the kernel, taking advantage of the recursive relationship in (1). In fact, depending on\nthe choice of q\u03b8(\u00b7), the hidden vectors may even be in\ufb01nite-dimensional. However, because of the\n(cid:124)\n1 ht, for rigorous analysis q\u03b8(\u00b7) should satisfy Mercer\u2019s condition [11, 29].\n(cid:124)\nzt) = \u02dch\nrelationship q\u03b8(\u02dcz\nThe vectors (\u02dch1, \u02dch0, \u02dch\u22121, . . . ) are assumed to satisfy the same recurrence setup as (1), with each\nvector in the associated sequence (\u02dcxt, \u02dcxt\u22121, . . . ) assumed to be the same \u02dcxi at each time, i.e.,\nassociated with ei, (\u02dcxt, \u02dcxt\u22121, . . . ) \u2192 (\u02dcxi, \u02dcxi, . . . ). Stepping backwards in time three steps, for\nexample, one may show\n\nk\u03b8(\u02dczi, zt) = q\u03b8[\u02dcx\n\n(cid:124)\nThe inner product \u02dch\n\u22124ht\u22124 encapsulates contributions for all times further backwards, and for a\n(cid:124)\nsequence of length N, \u02dch\n\u2212N ht\u2212N plays a role analogous to a bias. As discussed below, for stability\nthe repeated application of q\u03b8(\u00b7) yields diminishing (fading) contributions from terms earlier in time,\n(cid:124)\nand therefore for large N the impact of \u02dch\n\u2212N ht\u2212N on k\u03b8(\u02dczi, zt) is small.\nThe overall model may be expressed as\n\nh(cid:48)\nt = q\u03b8(ct) ,\n\nct = \u02dcct + q\u03b8(ct\u22121) , \u02dcct = \u02dcXxt\n\n(6)\n(cid:124)\nwhere ct \u2208 Rj is a memory cell at time t, row i of \u02dcX corresponds to \u02dcx\ni , and q\u03b8(ct) operates pointwise\non the components of ct (see Figure 1). At the start of the sequence of length N, q\u03b8(ct\u2212N ) may be\n(cid:124)\nseen as a vector of biases, effectively corresponding to \u02dch\nN ht\u2212N ; we henceforth omit discussion of\nthis initial bias for notational simplicity, and because for suf\ufb01ciently large N its impact on h(cid:48)\nt is small.\nNote that via the recursive process by which ct is evaluated in (6), the kernel evaluations re\ufb02ected\nby q\u03b8(ct) are de\ufb01ned entirely by the elements of the sequence (\u02dcct, \u02dcct\u22121, \u02dcct\u22122, . . . ). Let \u02dcci,t repre-\nsent the i-th component in vector \u02dcct, and de\ufb01ne x\u2264t = (xt, xt\u22121, xt\u22122, . . . ). Then the sequence\n(\u02dcci,t, \u02dcci,t\u22121, \u02dcci,t\u22122, . . . ) is speci\ufb01ed by convolving in time \u02dcxi with x\u2264t, denoted \u02dcxi \u2217 x\u2264t. Hence, the\nj components of the sequence (\u02dcct, \u02dcct\u22121, \u02dcct\u22122, . . . ) are completely speci\ufb01ed by convolving x\u2264t with\neach of the j \ufb01lters, \u02dcxi, i \u2208 {1, . . . , j}, i.e., taking an inner product of \u02dcxi with the vector in x\u2264t at\neach time point.\nIn (4) we represented h(cid:48)\ni,t = k\u03b8(\u02dczi, zt); now, because of the recursive form of the\n(cid:124)\ni zt), we have demonstrated that we\nmodel in (1), and because of the assumption k\u03b8(\u02dczi, zt) = q\u03b8(\u02dcz\n2) [14], as for a Gaussian kernel,\nzt).\n\n(cid:124)\n(cid:124)\ni \u02dcxi = 1), then q\u03b8((cid:107)\u02dcz \u2212 zt(cid:107)2\nbut if vectors xt and \ufb01lters \u02dcxi are normalized (e.g., x\nt xt = \u02dcx\n\n2One may also design recurrent kernels of the form k\u03b8(\u02dcz, zt) = q\u03b8((cid:107)\u02dcz \u2212 zt(cid:107)2\n\ni,t = q\u03b8(ci,t) as h(cid:48)\n\n(cid:124)\n2) reduces to q\u03b8(\u02dcz\n\n3\n\n\fmay express the kernel equivalently as k\u03b8(\u02dcxi \u2217 x\u2264t), to underscore that it is de\ufb01ned entirely by the\nelements at the output of the convolution \u02dcxi \u2217 x\u2264t. Hence, we may express component i of h(cid:48)\nt as\ni,t = k\u03b8(\u02dcxi \u2217 x\u2264t).\nh(cid:48)\nComponent l \u2208 {1, . . . , V } of yt = Ah(cid:48)\n\nt may be expressed\n\nyl,t =\n\nAl,ik\u03b8(\u02dcxi \u2217 x\u2264t)\n\n(7)\n\nj(cid:88)\n\ni=1\n\nwhere Al,i represents component (l, i) of matrix A. Considering (7), the connection of an RNN to an\nRKHS is clear, as made explicit by the kernel k\u03b8(\u02dcxi \u2217 x\u2264t). The RKHS is manifested for the \ufb01nal\noutput yt, with the hidden ht now absorbed within the kernel, via the inner product (4). The feedback\nimposed via latent vector ht is constituted via update of the memory cell ct = \u02dcct + q\u03b8(ct\u22121) used to\nevaluate the kernel.\nRather than evaluating yt as in (7), it will prove convenient to return to (6). Speci\ufb01cally, we may\nconsider modifying (6) by injecting further feedback via h(cid:48)\n\nt, augmenting (6) as\n\nh(cid:48)\nt = q\u03b8(ct) ,\n\nct = \u02dcct + q\u03b8(ct\u22121) , \u02dcct = \u02dcXxt + \u02dcHh(cid:48)\n\nt\u22121\n\n(8)\n\nwhere \u02dcH \u2208 Rj\u00d7j, and recalling yt = Ah(cid:48)\nt (see Figure 2a for illustration). In (8) the input to the\nkernel is dependent on the input elements (xt, xt\u22121, . . . ) and is now also a function of the kernel\noutputs at the previous time, via h(cid:48)\nt is still speci\ufb01ed entirely by the elements\nof \u02dcxi \u2217 x\u2264t, for i \u2208 {1, . . . , j}.\n\nt\u22121. However, note that h(cid:48)\n\n3 Choice of Recurrent Kernels & Introduction of Gating Networks\n3.1 Fixed kernel parameters & time-invariant memory-cell gating\nThe function q\u03b8(\u00b7) discussed above may take several forms, the simplest of which is a linear kernel,\nwith which (8) takes the form\n\n(9)\n\nh(cid:48)\nt = ct ,\n\nct = \u03c32\n\ni \u02dcct + \u03c32\n\ni and \u03c32\ni and \u03c32\n\nf ct\u22121 , \u02dcct = \u02dcXxt + \u02dcHh(cid:48)\nt\u22121\nf < 1 for stability. The\nf (using analogous notation from [14]) are scalars, with \u03c32\nwhere \u03c32\nf may be viewed as static (i.e., time-invariant) gating elements, with \u03c32\ni controlling\nscalars \u03c32\nf controlling how much of the prior\nweighting on the new input element to the memory cell, and \u03c32\nmemory unit is retained; given \u03c32\nf < 1, this means information from previous time steps tends to\nfade away and over time is largely forgotten. However, such a kernel leads to time-invariant decay\nof memory: the contribution \u02dcct\u2212N from N steps before to the current memory ct is (\u03c3i\u03c3N\nf )2\u02dcct\u2212N ,\nmeaning that it decays at a constant exponential rate. Because the information contained at each time\nstep can vary, this can be problematic. This suggests augmenting the model, with time-varying gating\nweights, with memory-component dependence on the weights, which we consider below.\n3.2 Dynamic gating networks & LSTM-like model\nRecent work has shown that dynamic gating can be seen as making a recurrent network quasi-invariant\nto temporal warpings [30]. Motivated by the form of the model in (9) then, it is natural to impose\ndynamic versions of \u03c32\nf ; we also introduce dynamic gating at the output of the hidden vector.\nThis yields the model:\nt = ot (cid:12) ct ,\nh(cid:48)\nt + bo) ,\n\n(10)\n(11)\nt\u22121], and Wc encapsulates \u02dcX and \u02dcH. In (10)-(11) the symbol (cid:12) represents a\nwhere z(cid:48)\npointwise vector product (Hadamard); Wc, Wo, W\u03b7 and Wf are weight matrices; bo, b\u03b7 and bf are\nbias vectors; and \u03c3(\u03b1) = 1/(1 + exp(\u2212\u03b1)). In (10), \u03b7t and ft play dynamic counterparts to \u03c32\ni and\nf , respectively. Further, ot, \u03b7t and ft are vectors, constituting vector-component-dependent gating.\n\u03c32\nNote that starting from a recurrent kernel machine, we have thus derived a model closely resembling\nthe LSTM. We call this model RKM-LSTM (see Figure 2).\nConcerning the update of the hidden state, h(cid:48)\nhyperbolic-tangent tanh nonlinearity: h(cid:48)\n\nt = ot (cid:12) ct in (10), one may also consider appending a\nt = ot (cid:12) tanh(ct). However, recent research has suggested\n\nct = \u03b7t (cid:12) \u02dcct + ft (cid:12) ct\u22121 ,\n\n\u02dcct = Wcz(cid:48)\nt\nt + bf )\n\n\u03b7t = \u03c3(W\u03b7z(cid:48)\n\nt + b\u03b7) ,\n\nft = \u03c3(Wf z(cid:48)\n\ni and \u03c32\n\not = \u03c3(Woz(cid:48)\n\nt = [xt, h(cid:48)\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: a) Recurrent kernel machine, with feedback, as de\ufb01ned in (8). b) Making a linear kernel\nassumption and adding input, forget, and output gating, this model becomes the RKM-LSTM.\n\nnot using such a nonlinearity [20, 10, 7], and this is a natural consequence of our recurrent kernel\nt = ot (cid:12) tanh(ct), the model in (10) and (11) is in the form of the LSTM, except\nanalysis. Using h(cid:48)\nwithout the nonlinearity imposed on the memory cell \u02dcct, while in the LSTM a tanh nonlinearity (and\nbiases) is employed when updating the memory cell [15, 13], i.e., for the LSTM \u02dcct = tanh(Wcz(cid:48)\nt +\nbc). If ot = 1 for all time t (no output gating network), and if \u02dcct = Wcxt (no dependence on h(cid:48)\nt\u22121\nfor update of the memory cell), this model reduces to the recurrent additive network (RAN) [20].\nWhile separate gates \u03b7t and ft were constituted in (10) and (11) to operate on the new and prior\ncomposition of the memory cell, one may also also consider a simpler model with memory cell\nupdated ct = (1\u2212 ft)(cid:12) \u02dcct + ft(cid:12) ct\u22121; this was referred to as having a Coupled Input and Forget Gate\n(CIFG) in [13]. In such a model, the decisions of what to add to the memory cell and what to forget\nare made jointly, obviating the need for a separate input gate \u03b7t. We call this variant RKM-CIFG.\n4 Extending the Filter Length\n4.1 Generalized form of recurrent model\nConsider a generalization of (1):\n\nht = f (W (x0)xt + W (x\u22121)xt\u22121 + \u00b7\u00b7\u00b7 + W (x\u2212n+1)xt\u2212n+1 + W (h)ht\u22121 + b)\n\n(12)\nwhere W (x\u00b7) \u2208 Rd\u00d7m, W (h) \u2208 Rd\u00d7d, and therefore the update of the hidden state ht\n3 depends on\ndata observed n \u2265 1 time steps prior, and also on the previous hidden state ht\u22121. Analogous to (3),\nwe may express\n\ni,t = e\n\n(cid:124)\ni ht.\n\nei = f (W (x0) \u02dcxi,0 + W (x\u22121) \u02dcxi,\u22121 + \u00b7\u00b7\u00b7 + W (x\u2212n+1) \u02dcxi,\u2212n+1 + W (h)\u02dchi + b)\n\n(13)\nThe inner product f (W (x0)xt + W (x\u22121)xt\u22121 + \u00b7\u00b7\u00b7 + W (x\u2212n+1)xt\u2212n+1 + W (h)ht\u22121 +\nf (W (x0) \u02dcxi,0 + W (x\u22121) \u02dcxi,\u22121 + \u00b7\u00b7\u00b7 + W (x\u2212n+1) \u02dcxi,\u2212n+1 + W (h)\u02dchi + b) is assumed represented\n(cid:124)\nb)\nby a Mercer kernel, and h(cid:48)\nLet Xt = (xt, xt\u22121, . . . , xt\u2212n+1) \u2208 Rm\u00d7n be an n-gram input with zero padding if t < (n \u2212 1),\nand \u02dcX = ( \u02dcX0, \u02dcX\u22121, . . . , \u02dcX\u2212n+1) be n sets of \ufb01lters, with the i-th rows of \u02dcX0, \u02dcX\u22121, . . . , \u02dcX\u2212n+1\ncollectively represent the i-th n-gram \ufb01lter, with i \u2208 {1, . . . , j}. Extending Section 2, the kernel is\nde\ufb01ned\n(14)\nwhere \u02dcX \u00b7 Xt \u2261 \u02dcX0xt + \u02dcX\u22121xt\u22121 + \u00b7\u00b7\u00b7 + \u02dcX\u2212n+1xt\u2212n+1 \u2208 Rj. Note that \u02dcX \u00b7 Xt corresponds\nto the t-th component output from the n-gram convolution of the \ufb01lters \u02dcX and the input sequence;\nt = k\u03b8( \u02dcX \u2217 x\u2264t), emphasizing that\ntherefore, similar to Section 2, we represent h(cid:48)\nthe kernel evaluation is a function of outputs of the convolution \u02dcX \u2217 x\u2264t, here with n-gram \ufb01lters.\nLike in the CNN [18, 37, 17], different \ufb01lter lengths (and kernels) may be considered to constitute\ndifferent components of the memory cell.\n4.2 Linear kernel, CNN and Gated CNN\nFor the linear kernel discussed in connection to (9), equation (14) becomes\n\nt = q\u03b8(ct) as h(cid:48)\n\nh(cid:48)\nt = q\u03b8(ct) ,\n\nct = \u02dcct + q\u03b8(ct\u22121) ,\n\n\u02dcct = \u02dcX \u00b7 Xt\n\nh(cid:48)\nt = ct = \u03c32\n\ni ( \u02dcX \u00b7 Xt) + \u03c32\nf h(cid:48)\n\nt\u22121\n\n(15)\n\n3Note that while the same symbol is used as in (12), ht clearly takes on a different meaning when n > 1.\n\n5\n\n\f\u03b7t = \u03c3( \u02dcX \u03b7 \u00b7 Xt + b\u03b7)\n\nt = \u03b7t (cid:12) ( \u02dcX \u00b7 Xt) ,\nh(cid:48)\n\nf = 0 and \u03c32\n\nFor the special case of \u03c32\ni = 1), (15) reduces to a\nconvolutional neural network (CNN), with a nonlinear operation typically applied subsequently to h(cid:48)\nt.\ni to a constant, one may impose dynamic gating, yielding the model (with\nRather than setting \u03c32\nf = 0)\n\u03c32\n\ni equal to a constant (e.g., \u03c32\n\n(16)\nwhere \u02dcX \u03b7 are distinct convolutional \ufb01lters for calculating \u03b7t, and b\u03b7 is a vector of biases. The form\nof the model in (16) corresponds to the Gated CNN [10], which we see as a a special case of the\nrecurrent model with linear kernel, and dynamic kernel weights (and without feedback, i.e., \u03c32\nf = 0).\nNote that in (16) a nonlinear function is not imposed on the output of the convolution \u02dcX \u00b7 Xt, there\nis only dynamic gating via multiplication with \u03b7t; the advantages of which are discussed in [10].\nFurther, the n-gram input considered in (12) need not be consecutive. If spacings between inputs of\nmore than 1 are considered, then the dilated convolution (e.g., as used in [31]) is recovered.\n4.3 Feedback and the generalized LSTM\nNow introducing feedback into the memory cell, the model in (8) is extended to\n\nct = \u02dcct + q\u03b8(ct\u22121) , \u02dcct = \u02dcX \u00b7 Xt + \u02dcHh(cid:48)\n\nt\u22121\n\nh(cid:48)\nt = q\u03b8(ct) ,\n\n(17)\n\nAgain motivated by the linear kernel, generalization of (17) to include gating networks is\n\nt\u22121\n\nt = ot (cid:12) ct ,\nh(cid:48)\n\nct = \u03b7t (cid:12) \u02dcct + ft (cid:12) ct\u22121 , \u02dcct = \u02dcX \u00b7 Xt + \u02dcHh(cid:48)\n\nt\u22121 +bo), \u03b7t = \u03c3( \u02dcX\u03b7\u00b7Xt + \u02dcW\u03b7h(cid:48)\n\nt\u22121 +b\u03b7), ft = \u03c3( \u02dcXf \u00b7Xt + \u02dcWf h(cid:48)\n\n(18)\nt\u22121 +bf )\n(19)\nt and \u02dcXo, \u02dcX\u03b7, and \u02dcXf are separate sets of n-gram convolutional \ufb01lters akin to \u02dcX.\n\not = \u03c3( \u02dcXo\u00b7Xt + \u02dcWoh(cid:48)\nwhere yt = Ah(cid:48)\nAs an n-gram generalization of (10)-(11), we refer to (18)-(19) as an n-gram RKM-LSTM.\nThe model in (18) and (19) is similar to the LSTM, with important differences: (i) there is not a\nnonlinearity imposed on the update to the memory cell, \u02dcct, and therefore there are also no biases\nimposed on this cell update; (ii) there is no nonlinearity on the output; and (iii) via the convolutions\nwith \u02dcX, \u02dcXo, \u02dcX\u03b7, and \u02dcXf , the memory cell can take into account n-grams, and the length of such\nsequences ni may vary as a function of the element of the memory cell.\n5 Related Work\nIn our development of the kernel perspective of the RNN, we have emphasized that the form of the\n(cid:124)\nkernel k\u03b8(\u02dczi, zt) = q\u03b8(\u02dcz\ni zt) yields a recursive means of kernel evaluation that is only a function of\nthe elements at the output of the convolutions \u02dcX \u2217 x\u2264t or \u02dcX \u2217 x\u2264t, for 1-gram and (n > 1)-gram\n\ufb01lters, respectively. This underscores that at the heart of such models, one performs convolutions\nbetween the sequence of data (. . . , xt+1, xt, xt\u22121, . . . ) and \ufb01lters \u02dcX or \u02dcX. Consideration of \ufb01lters\nof length greater than one (in time) yields a generalization of the traditional LSTM. The dependence\nof such models entirely on convolutions of the data sequence and \ufb01lters is evocative of CNN and\nGated CNN models for text [18, 37, 17, 10], with this made explicit in Section 4.2 as a special case.\nThe Gated CNN in (16) and the generalized LSTM in (18)-(19) both employ dynamic gating.\nHowever, the generalized LSTM explicitly employs a memory cell (and feedback), and hence offers\nthe potential to leverage long-term memory. While memory affords advantages, a noted limitation\nof the LSTM is that computation of h(cid:48)\nt is sequential, undermining parallel computation, particularly\nwhile training [10, 33]. In the Gated CNN, h(cid:48)\nt comes directly from the output of the gated convolution,\nallowing parallel \ufb01tting of the model to time-dependent data. While the Gated CNN does not employ\nrecurrence, the \ufb01lters of length n > 1 do leverage extended temporal dependence. Further, via deep\nGated CNNs [10], the effective support of the \ufb01lters at deeper layers can be expansive.\n(cid:124)\nRecurrent kernels of the form k\u03b8(\u02dcz, zt) = q\u03b8(\u02dcz\nzt) were also developed in [14], but with the goal of\nextending recurrent kernel machines to sequential inputs, rather than making connections with RNNs.\nThe formulation in Section 2 has two important differences with that prior work. First, we employ\n(cid:124)\nthe same vector \u02dcxi for all shift positions t of the inner product \u02dcx\ni xt. By contrast, in [14] effectively\nin\ufb01nite-dimensional \ufb01lters are used, because the \ufb01lter \u02dcxt,i changes with t. This makes implementation\ncomputationally impractical, necessitating truncation of the long temporal \ufb01lter. Additionally, the\nfeedback of h(cid:48)\nt in (8) was not considered, and as discussed in Section 3.2, our proposed setup yields\nnatural connections to long short-term memory (LSTM) [15, 13].\n\n6\n\n\fParameters\n\nCell\n\nOutput\n\n(nm + d)(4d)\n(nm + d)(4d)\n(nm + d)(3d)\n(nm + d)(2d)\n(nm + d)(d)\n\nModel\nLSTM [15]\nRKM-LSTM\nRKM-CIFG\nLinear Kernel w/ ot\nLinear Kernel\nGated CNN [10]\nCNN [18]\nTable 1: Model variants under consideration, assuming 1-gram inputs. Concatenating additional\ninputs xt\u22121, . . . , xt\u2212n+1 to z(cid:48)\nt in the Input column yields the corresponding n-gram model. Number\nof model parameters are shown for input xt \u2208 Rm and output h(cid:48)\n\nt = ot (cid:12) tanh(ct)\nh(cid:48)\nt = ot (cid:12) ct\nh(cid:48)\nt = ot (cid:12) ct\nh(cid:48)\nt = ot (cid:12) ct\nh(cid:48)\nh(cid:48)\nt = tanh(ct)\nt = ot (cid:12) ct\nh(cid:48)\nh(cid:48)\nt = tanh(ct)\n\nInput\nz(cid:48)\nt = [xt, h(cid:48)\nz(cid:48)\nt = [xt, h(cid:48)\nz(cid:48)\nt = [xt, h(cid:48)\nz(cid:48)\nt = [xt, h(cid:48)\nz(cid:48)\nt = [xt, h(cid:48)\nz(cid:48)\nt = xt\nz(cid:48)\nt = xt\n\nct = \u03b7t (cid:12) tanh(\u02dcct) + ft (cid:12) ct\u22121\nct = (1 \u2212 ft) (cid:12) \u02dcct + ft (cid:12) ct\u22121\n\nct = \u03b7t (cid:12) \u02dcct + ft (cid:12) ct\u22121\n\ni \u02dcct + \u03c32\ni \u02dcct + \u03c32\nct = \u03c32\ni \u02dcct\nct = \u03c32\ni \u02dcct\n\nt\u22121]\nt\u22121]\nt\u22121]\nt\u22121]\nt\u22121]\n\nct = \u03c32\nct = \u03c32\n\nf ct\u22121\nf ct\u22121\n\n(nm)(2d)\n(nm)(d)\n\nt \u2208 Rd.\n\nPrior work analyzing neural networks from an RKHS perspective has largely been based on the\nfeature mapping \u03d5\u03b8(x) and the weight \u03c9 [1, 5, 23, 36]. For the recurrent model of interest here,\nfunction ht = f (W (x)xt + W (h)ht\u22121 + b) plays a role like \u03d5\u03b8(x) as a mapping of an input xt to\nwhat may be viewed as a feature vector ht. However, because of the recurrence, ht is a function of\n(xt, xt\u22121, . . . ) for an arbitrarily long time period prior to time t:\nht(xt, xt\u22121, . . . ) = f (W (x)xt + b + W (h)f (W (x)xt\u22121 + b + W (h)f (W (x)xt\u22122 + b + . . . ))) (20)\nHowever, rather than explicitly working with ht(xt, xt\u22121, . . . ), we focus on the kernel k\u03b8(\u02dczi, zt) =\n(cid:124)\ni zt) = k\u03b8(\u02dcxi \u2217 x\u2264t).\nq\u03b8(\u02dcz\nThe authors of [21] derive recurrent neural networks from a string kernel by replacing the exact\nmatching function with an inner product and assume the decay factor to be a nonlinear function.\nConvolutional neural networks are recovered by replacing a pointwise multiplication with addition.\nHowever, the formulation cannot recover the standard LSTM formulation, nor is there a consistent\nformulation for all the gates. The authors of [28] introduce a kernel-based update rule to approximate\nbackpropagation through time (BPTT) for RNN training, but still follow the standard RNN structure.\nPrevious works have considered recurrent models with n-gram inputs as in (12). For example,\nstrongly-typed RNNs [3] consider bigram inputs, but the previous input xt\u22121 is used as a replacement\nfor ht\u22121 rather than in conjunction, as in our formulation. Quasi-RNNs [6] are similar to [3], but\ngeneralize them with a convolutional \ufb01lter for the input and use different nonlinearities. Inputs\ncorresponding to n-grams have also been implicitly considered by models that use convolutional\nlayers to extract features from n-grams that are then fed into a recurrent network (e.g., [8, 35, 38]).\nRelative to (18), these models contain an extra nonlinearity f (\u00b7) from the convolution and projection\nmatrix W (x) from the recurrent cell, and no longer recover the CNN [18, 37, 17] or Gated CNN [10]\nas special cases.\n6 Experiments\nIn the following experiments, we consider several model variants, with nomenclature as follows.\nThe n-gram LSTM developed in Sec. 4.3 is a generalization of the standard LSTM [15] (for which\nn = 1). We denote RKM-LSTM (recurrent kernel machine LSTM) as corresponding to (10)-(11),\nwhich resembles the n-gram LSTM, but without a tanh nonlinearity on the cell update \u02dcct or emission\nct. We term RKM-CIFG as a RKM-LSTM with \u03b7t = 1 \u2212 ft, as discussed in Section 3.2. Linear\nKernel w/ ot corresponds to (10)-(11) with \u03b7t = \u03c32\nf time-invariant\nconstants; this corresponds to a linear kernel for the update of the memory cell, and dynamic gating\non the output, via ot. We also consider the same model without dynamic gating on the output, i.e.,\not = 1 for all t (with a tanh nonlinearity on the output), which we call Linear Kernel. The Gated\nCNN corresponds to the model in [10], which is the same as Linear Kernel w/ ot, but with \u03c32\nf = 0\n(i.e., no memory). Finally, we consider a CNN model [18], that is the same as the Linear Kernel\nmodel, but without feedback or memory, i.e., z(cid:48)\nf = 0. For all of these, we may also\nconsider an n-gram generalization as introduced in Section 4. For example, a 3-gram RKM-LSTM\ncorresponds to (18)-(19), with length-3 convolutional \ufb01lters in the time dimension. The models are\nsummarized in Table 1. All experiments are run on a single NVIDIA Titan X GPU.\nDocument Classi\ufb01cation We show results for several popular document classi\ufb01cation datasets\n[37] in Table 2. The AGNews and Yahoo! datasets are topic classi\ufb01cation tasks, while Yelp Full\nis sentiment analysis and DBpedia is ontology classi\ufb01cation. The same basic network architecture\n\ni and ft = \u03c32\n\nt = xt and \u03c32\n\nf , with \u03c32\n\ni and \u03c32\n\n7\n\n\fParameters\n\nAGNews\n\nDBpedia\n\nYahoo!\n\nYelp Full\n\nModel\nLSTM\nRKM-LSTM\nRKM-CIFG\nLinear Kernel w/ ot\nLinear Kernel\nGated CNN [10]\nCNN [18]\nTable 2: Document classi\ufb01cation accuracy for 1-gram and 3-gram versions of various models. Total\nparameters of each model are shown, excluding word embeddings and the classi\ufb01er.\n\n1-gram 3-gram 1-gram 3-gram 1-gram 3-gram 1-gram 3-gram 1-gram 3-gram\n66.37\n720K\n66.43\n720K\n65.92\n540K\n65.94\n360K\n180K\n62.11\n64.30\n180K\n90K\n62.08\n\n1.44M\n1.44M\n1.08M\n720K\n360K\n540K\n270K\n\n92.46\n92.28\n92.39\n91.49\n91.50\n91.78\n91.53\n\n77.74\n77.70\n77.71\n77.41\n76.93\n72.92\n72.51\n\n66.27\n65.92\n65.93\n65.35\n61.18\n60.25\n59.77\n\n77.72\n77.72\n77.91\n77.53\n76.53\n76.66\n75.97\n\n98.98\n98.97\n98.99\n98.96\n98.65\n98.37\n98.17\n\n91.82\n91.76\n92.29\n92.07\n91.62\n91.54\n91.20\n\n98.97\n99.00\n99.05\n98.94\n98.77\n98.77\n98.52\n\nModel\nLSTM [15, 25]\nRKM-LSTM\nRKM-CIFG\nLinear Kernel w/ ot\n\nPTB\n\nWikitext-2\n\nPPL valid PPL test PPL valid PPL test\n\n61.2\n60.3\n61.9\n72.3\n\n58.9\n58.2\n59.5\n69.7\n\n68.74\n67.85\n69.12\n84.23\n\n65.68\n65.22\n66.03\n80.21\n\nTable 3: Language model perplexity (PPL) on validation and test sets of the Penn Treebank and\nWikitext-2 language modeling tasks.\n\ni = \u03c32\n\nf = 0.5.\n\nis used for all models, with the only difference being the choice of recurrent cell, which we make\nsingle-layer and unidirectional. Hidden representations h(cid:48)\nt are aggregated with mean pooling across\ntime, followed by two fully connected layers, with the second having output size corresponding to\nthe number of classes of the dataset. We use 300-dimensional GloVe [27] as our word embedding\ninitialization and set the dimensions of all hidden units to 300. We follow the same preprocessing\nprocedure as in [34]. Layer normalization [2] is performed after the computation of the cell state ct.\nFor the Linear Kernel w/ ot and the Linear Kernel, we set4 \u03c32\nNotably, the derived RKM-LSTM model performs comparably to the standard LSTM model across\nall considered datasets. We also \ufb01nd the CIFG version of the RKM-LSTM model to have similar\naccuracy. As the recurrent model becomes less sophisticated with regard to gating and memory,\nwe see a corresponding decrease in classi\ufb01cation accuracy. This decrease is especially signi\ufb01cant\nfor Yelp Full, which requires a more intricate comprehension of the entire text to make a correct\nprediction. This is in contrast to AGNews and DBpedia, where the success of the 1-gram CNN\nindicates that simple keyword matching is suf\ufb01cient to do well. We also observe that generalizing the\nmodel to consider n-gram inputs typically improves performance; the highest accuracies for each\ndataset were achieved by an n-gram model.\nLanguage Modeling We also perform experiments on popular word-level language generation\ndatasets Penn Tree Bank (PTB) [24] and Wikitext-2 [26], reporting validation and test perplexities\n(PPL) in Table 3. We adopt AWD-LSTM [25] as our base model5, replacing the standard LSTM\nwith RKM-LSTM, RKM-CIFG, and Linear Kernel w/ ot to do our comparison. We keep all other\nhyperparameters the same as the default. Here we consider 1-gram \ufb01lters, as they performed best\nfor this task; given that the datasets considered here are smaller than those for the classi\ufb01cation\nexperiments, 1-grams are less likely to over\ufb01t. Note that the static gating on the update of the memory\ncell (Linear Kernel w/ ot) does considerably worse than the models with dynamic input and forget\ngates on the memory cell. The RKM-LSTM model consistently outperforms the traditional LSTM,\nagain showing that the models derived from recurrent kernel machines work well in practice for the\ndata considered.\nLFP Classi\ufb01cation We perform experiments on a Local Field Potential (LFP) dataset. The LFP\nsignal is multi-channel time series recorded inside the brain to measure neural activity. The LFP\ndataset used in this work contains recordings from 29 mice (wild-type or CLOCK\u220619 [32]), while\nthe mice were (i) in their home cages, (ii) in an open \ufb01eld, and (iii) suspended by their tails. There\nare a total of m = 11 channels and the sampling rate is 1000Hz. The goal of this task is to predict\n\ni and \u03c32\n\n4\u03c32\n5We use the of\ufb01cial codebase https://github.com/salesforce/awd-lstm-lm and report ex-\n\nf can also be learned, but we found this not to have much effect on the \ufb01nal performance.\n\nperiment results before two-step \ufb01ne-tuning.\n\n8\n\n\fModel\n\nAccuracy\n\nn-gram\nLSTM\n80.24\n\nRKM-\nLSTM\n79.02\n\nRKM-\nCIFG\n77.58\n\nLinear\n\nKernel w/ ot\n\n76.11\n\nLinear\nKernel\n73.13\n\nGated\nCNN [10] CNN [22]\n\n76.02\n\n73.40\n\nTable 4: Mean leave-one-out classi\ufb01cation accuracies for mouse LFP data. For each model, (n = 40)-\ngram \ufb01lters are considered, and the number of \ufb01lters in each model is 30.\n\nthe state of a mouse from a 1 second segment of its LFP recording as a 3-way classi\ufb01cation problem.\nIn order to test the model generalizability, we perform leave-one-out cross-validation testing: data\nfrom each mouse is left out as testing iteratively while the remaining mice are used as training.\nSyncNet [22] is a CNN model with speci\ufb01cally designed wavelet \ufb01lters for neural data. We incorporate\nthe SyncNet form of n-gram convolutional \ufb01lters into our recurrent framework (we have parameteric\nn-gram convolutional \ufb01lters, with parameters learned). As was demonstrated in Section 4.2, the\nCNN is a memory-less special case of our derived generalized LSTM. An illustration of the modi\ufb01ed\nmodel (Figure 3) can be found in Appendix A, along with other further details on SyncNet.\nWhile the \ufb01lters of SyncNet are interpretable and can prevent over\ufb01tting (because they have a small\nnumber of parameters), the same kind of generalization to an n-gram LSTM can be made without\nincreasing the number of learned parameters. We do so for all of the recurrent cell types in Table\n1, with the CNN corresponding to the original SyncNet model. Compared to the original SyncNet\nmodel, our newly proposed models can jointly consider the time dependency within the whole signal.\nThe mean classi\ufb01cation accuracies across all mice are compared in Table 4, where we observe\nsubstantial improvements in prediction accuracy through the addition of memory cells to the model.\nThus, considering the time dependency in the neural signal appears to be bene\ufb01cial for identifying\nhidden patterns. Classi\ufb01cation performances per subject (Figure 4) can be found in Appendix A.\n7 Conclusions\nThe principal contribution of this paper is a new perspective on gated RNNs, leveraging concepts\nfrom recurrent kernel machines. From that standpoint, we have derived a model closely connected\nto the LSTM [15, 13] (for convolutional \ufb01lters of length one), and have extended such models to\nconvolutional \ufb01lters of length greater than one, yielding a generalization of the LSTM. The CNN\n[18, 37, 17], Gated CNN [10] and RAN [20] models are recovered as special cases of the developed\nframework. We have demonstrated the ef\ufb01cacy of the derived models on NLP and neuroscience tasks,\nfor which our RKM variants show comparable or better performance than the LSTM. In particular,\nwe observe that extending LSTM variants with convolutional \ufb01lters of length greater than one can\nsigni\ufb01cantly improve the performance in LFP classi\ufb01cation relative to recent prior work.\n\nAcknowledgments\nThe research reported here was supported in part by DARPA, DOE, NIH, NSF and ONR.\n\nReferences\n[1] Fabio Anselmi, Lorenzo Rosasco, Cheston Tan, and Tomaso Poggio. Deep Convolutional\n\nNetworks are Hierarchical Kernel Machines. arXiv:1508.01084, 2015.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.\n\narXiv:1607.06450, 2016.\n\nLayer Normalization.\n\n[3] David Balduzzi and Muhammad Ghifary. Strongly-Typed Recurrent Neural Networks. Interna-\n\ntional Conference on Machine Learning, 2016.\n\n[4] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert spaces in Probability\n\nand Statistics. Kluwer Publishers, 2004.\n\n[5] Alberto Bietti and Julien Mairal. Invariance and Stability of Deep Convolutional Representations.\n\nNeural Information Processing Systems, 2017.\n\n[6] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural\n\nnetworks. International Conference of Learning Representations, 2017.\n\n9\n\n\f[7] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George\nFoster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff\nHughes. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation.\narXiv:1804.09849v2, 2018.\n\n[8] Jianpeng Cheng and Mirella Lapata. Neural Summarization by Extracting Sentences and Words.\n\nAssociation for Computational Linguistics, 2016.\n\n[9] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-\nDecoder for Statistical Machine Translation. Empirical Methods in Natural Language Process-\ning, 2014.\n\n[10] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language Modeling with\n\nGated Convolutional Networks. International Conference on Machine Learning, 2017.\n\n[11] Marc G. Genton. Classes of Kernels for Machine Learning: A Statistics Perspective. Journal of\n\nMachine Learning Research, 2001.\n\n[12] David Golub and Xiaodong He. Character-Level Question Answering with Attention. Empirical\n\nMethods in Natural Language Processing, 2016.\n\n[13] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutn\u00edk, Bas R. Steunebrink, and J\u00fcrgen Schmidhu-\nber. LSTM: A Search Space Odyssey. Transactions on Neural Networks and Learning Systems,\n2017.\n\n[14] Michiel Hermans and Benjamin Schrauwen. Recurrent Kernel Machines: Computing with\n\nIn\ufb01nite Echo State Networks. Neural Computation, 2012.\n\n[15] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long Short-Term Memory. Neural Computation,\n\n1997.\n\n[16] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An Empirical Exploration of Recur-\n\nrent Network Architectures. International Conference on Machine Learning, 2015.\n\n[17] Yoon Kim. Convolutional Neural Networks for Sentence Classi\ufb01cation. Empirical Methods in\n\nNatural Language Processing, 2014.\n\n[18] Yann LeCun and Yoshua Bengio. Convolutional Networks for Images, Speech, and Time Series.\n\nThe Handbook of Brain Theory and Neural Networks, 1995.\n\n[19] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning\n\nApplied to Document Recognition. Proceedings of IEEE, 1998.\n\n[20] Kenton Lee, Omer Levy, and Luke Zettlemoyer.\n\narXiv:1705.07393v2, 2017.\n\nRecurrent Additive Networks.\n\n[21] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving Neural Architectures\n\nfrom Sequence and Graph Kernels. International Conference on Machine Learning, 2017.\n\n[22] Yitong Li, Michael Murias, Samantha Major, Geraldine Dawson, Kafui Dzirasa, Lawrence\nCarin, and David E. Carlson. Targeting EEG/LFP Synchrony with Neural Nets. Neural\nInformation Processing Systems, 2017.\n\n[23] Julien Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks.\n\nNeural Information Processing Systems, 2016.\n\n[24] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a Large\nAnnotated Corpus of English: The Penn Treebank. Association for Computational Linguistics,\n1993.\n\n[25] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTM\n\nLanguage Models. International Conference on Learning Representations, 2018.\n\n10\n\n\f[26] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture\n\nModels. International Conference of Learning Representations, 2017.\n\n[27] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for\n\nWord Representation. Empirical Methods in Natural Language Processing, 2014.\n\n[28] Christopher Roth, Ingmar Kanitscheider, and Ila Fiete. Kernel rnn learning (kernl). International\n\nConference Learning Representation, 2019.\n\n[29] Bernhard Scholkopf and Alexander J. Smola. Learning with kernels. MIT Press, 2002.\n\n[30] Corentin Tallec and Yann Ollivier. Can Recurrent Neural Networks Warp Time? International\n\nConference of Learning Representations, 2018.\n\n[31] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative\nModel for Raw Audio. arXiv:1609.03499, 2016.\n\n[32] Jordy van Enkhuizen, Arpi Minassian, and Jared W Young. Further evidence for Clock\u220619 mice\nas a model for bipolar disorder mania using cross-species tests of exploration and sensorimotor\ngating. Behavioural Brain Research, 249:44\u201354, 2013.\n\n[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,\nLukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. Neural Information Processing\nSystems, 2017.\n\n[34] Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ri-\ncardo Henao, and Lawrence Carin. Joint Embedding of Words and Labels for Text Classi\ufb01cation.\nAssociation for Computational Linguistics, 2018.\n\n[35] Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xuejie Zhang. Dimensional Sentiment Analysis\n\nUsing a Regional CNN-LSTM Model. Association for Computational Linguistics, 2016.\n\n[36] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep Kernel\n\nLearning. International Conference on Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[37] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level Convolutional Networks for Text\n\nClassi\ufb01cation. Neural Information Processing Systems, 2015.\n\n[38] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis C.M. Lau. A C-LSTM Neural Network\n\nfor Text Classi\ufb01cation. arXiv:1511.08630, 2015.\n\n11\n\n\fA More Details of the LFP Experiment\n\nmodel [22] into our framework, the weight W (x) = (cid:2)W (x0), W (x\u22121),\u00b7\u00b7\u00b7 , W (x\u2212n+1)(cid:3) de\ufb01ned in\n\nIn this section, we provide more details on the Sync-RKM model. In order to incorporate the SyncNet\n\nEq. (12) is parameterized as wavelet \ufb01lters. If there is a total of K \ufb01lters, then W (x) is of size\nK \u00d7 C \u00d7 n.\nSpeci\ufb01cally, suppose the n-gram input data at time t is given as Xt = [xt\u2212n+1,\u00b7\u00b7\u00b7 , xt] \u2208 RC\u00d7n\nwith channel number C and window size n. The k-th \ufb01lter for channel c can be written as\n\nW (x)\n\nkc = \u03b1kc cos (\u03c9kt + \u03c6kc) exp(\u2212\u03b2kt2)\n\n(21)\n\nW (x)\nkc has the form of the Morlet wavelet base function. Parameters to be learned are \u03b1kc, \u03c9k, \u03c6kc\nand \u03b2k for c = 1,\u00b7\u00b7\u00b7 C and k = 1,\u00b7\u00b7\u00b7 , K. t is a time grid of length n, which is a constant vector.\nIn the recurrent cell, each W (x)\nis convolved with the c-th channel of Xt using 1-d convolution.\nkc\nFigure 3 gives the framework of this Sync-RKM model. For more details of how the \ufb01lter works,\nplease refer to the original work [22].\n\nFigure 3: Illustration of the proposed model with SyncNet \ufb01lters. The input LFP signal is given by\nthe C \u00d7 T matrix. The SyncNet \ufb01lters (right) are applied on signal chunks at each time step.\n\nWhen applying the Sync-RKM model on LFP data, we choose the window size as n = 40 to consider\nthe time dependencies in the signal. Since the experiment is performed by treating each mouse as test\niteratively, we show the subject-wise classi\ufb01cation accuracy in Figure 4. The proposed model does\nconsistently better across nearly all subjects.\n\nFigure 4: Subject-wise classi\ufb01cation accuracy comparison for LFP dataset.\n\n12\n\n\ud835\udc6a\ud835\udc7b1\ud835\udc61\ud835\udc47Sync-RKMSync-RKMSync-RKM\u2026Sync-RKMSync-RKM\u202623\ud835\udc89\ud835\udc95=\ud835\udc87\ud835\udc7f\ud835\udc95+\ud835\udc7e(\ud835\udc89)\ud835\udc89\ud835\udc95\u2212\ud835\udfcf+\ud835\udc83\ud835\udc7e(\ud835\udc99)\f", "award": [], "sourceid": 1875, "authors": [{"given_name": "Kevin", "family_name": "Liang", "institution": "Duke University"}, {"given_name": "Guoyin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Yitong", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}