{"title": "Complex Gated Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 10536, "page_last": 10546, "abstract": "Complex numbers have long been favoured for digital signal processing, yet\ncomplex representations rarely appear in deep learning architectures. RNNs, widely\nused to process time series and sequence information, could greatly benefit from\ncomplex representations. We present a novel complex gated recurrent cell, which\nis a hybrid cell combining complex-valued and norm-preserving state transitions\nwith a gating mechanism. The resulting RNN exhibits excellent stability and\nconvergence properties and performs competitively on the synthetic memory and\nadding task, as well as on the real-world tasks of human motion prediction.", "full_text": "Complex Gated Recurrent Neural Networks\n\nMoritz Wolter\n\nInstitute for Computer Science\n\nUniversity of Bonn\n\nwolter@cs.uni-bonn.de\n\nAngela Yao\n\nSchool of Computing\n\nNational University of Singapore\n\nyaoa@comp.nus.edu.sg\n\nAbstract\n\nComplex numbers have long been favoured for digital signal processing, yet\ncomplex representations rarely appear in deep learning architectures. RNNs, widely\nused to process time series and sequence information, could greatly bene\ufb01t from\ncomplex representations. We present a novel complex gated recurrent cell, which\nis a hybrid cell combining complex-valued and norm-preserving state transitions\nwith a gating mechanism. The resulting RNN exhibits excellent stability and\nconvergence properties and performs competitively on the synthetic memory and\nadding task, as well as on the real-world tasks of human motion prediction.\n\n1 Introduction\n\nRecurrent neural networks (RNNs) are widely used for processing time series and sequential infor-\nmation. The dif\ufb01culties of training RNNs, especially when trying to learn long-term dependencies,\nare well-established, as RNNs are prone to vanishing and exploding gradients [2, 12, 31]. Heuristics\ndeveloped to alleviate some of the optimization instabilities and learning dif\ufb01culties include gradient\nclipping [9, 29], gating [4, 12], and using norm-preserving state transition matrices [1, 13, 16, 40].\nGating, as used in gated recurrent units (GRUs) [4] and long short-term memory (LSTM) net-\nworks [12], has become common-place in recurrent architectures. Gates facilitate the learning of\nlonger term temporal relationships [12]. Furthermore, in the presence of noise in the input signal,\ngates can protect the cell state from undesired updates, thereby improving overall stability and\nconvergence.\nA matrix W is norm-preserving if its repeated multiplication with a vector leaves the vector norm\nunchanged, i.e.(cid:107)Wh(cid:107)2 = (cid:107)h(cid:107)2. Norm-preserving state transition matrices are particularly interesting\nfor RNNs because they preserve gradients over time [1], thereby preventing both the vanishing\nand exploding gradient problem. To be norm-preserving, state transition matrices need to be either\northogonal or unitary1. Complex numbers have long been favored for signal processing [11, 24, 27]. A\ncomplex signal does not simply double the dimensionality of the signal. Instead, the representational\nrichness of complex signals is rooted in its physical relevance and the mathematical theory of\ncomplex analysis. Complex arithmetic, and in particular multiplication, is different from its real\ncounterpart and allows us to construct novel network architectures with several desirable properties.\nDespite networks being complex-valued, however, it is often necessary to work with real-valued cost\nfunctions and/or existing real-valued network components. Mappings from C \u2192 R are therefore\nindispensable. Unfortunately such functions violate the Cauchy-Riemann equations and are not\ncomplex-differentiable in the traditional sense. We advocate the use of Wirtinger calculus [39] (also\nknown as CR-calculus [21]), which makes it possible to de\ufb01ne complex (partial) derivatives, even\nwhen working with non-holomorph or non-analytic functions.\n\n1Unitary matrices are the complex analogue of orthogonal matrices, i.e. a complex matrix W is unitary if\n\nWW\n\nT\n\nT\n\n= W\n\nW = I, where W\n\nT is its conjugate transpose and I is the identity matrix.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fComplex-valued representations have begun receiving some attention in the the deep learning\ncommunity but they have been applied only to the most basic of architectures [1, 10, 36]. For\nrecurrent networks, complex representations could gain more acceptance if they were shown to be\ncompatible with more commonly used gated architectures and also competitive for real-world data.\nThis is exactly the aim of this work, where we propose a complex-valued gated recurrent network\nand show how it can easily be implemented with a standard deep learning library such as TensorFlow.\nOur contributions can be summarized as follows2:\n\n\u2022 We introduce a novel complex-gated recurrent unit; to the best of our knowledge, we are the\n\n\ufb01rst to explore such a structure using complex number representations.\n\n\u2022 We compare experimentally the effects of a bounded versus unbounded non-linearity in\nrecurrent networks, \ufb01nding additional evidence countering the commonly held heuristic\nthat only bounded non-linearities should be applied in RNNs. In our case unbounded\nnon-linearities perform better, but must be coupled with the stabilizing measure of using\nnorm-preserving state transition matrices.\n\n\u2022 Our complex gated network is stable and fast to train; it outperforms the state of the art\nwith equal parameters on synthetic tasks and delivers state-of-the-art performance one the\nreal-world application of predicting poses in human motion capture using fewer weights.\n\n2 Related work\n\nThe current body of literature in deep learning focuses predominantly on real-valued neural net-\nworks. Theory for learning with complex-valued data, however, was established long before the\nbreakthroughs of deep learning. This includes the development of complex non-linearities and\nactivation functions [7, 18], the computation of complex gradients and Hessians [37], and complex\nbackpropagation [3, 23].\nComplex-valued representations were \ufb01rst used in deep networks to model phase dependencies for\nmore biologically plausible neurons [33] and to augment the memory of LSTMs [5], i.e. whereby half\nof the cell state is interpreted as the imaginary component. In contrast, true complex-valued networks\n(including this work) have not only complex valued states but also kernels. Recently, complex CNNs\nhave been proposed as an alternative for classifying natural images [10, 36] and inverse mapping of\nMRI signals [38]. Complex CNNs were found to be competitive or better than state-of-the-art [36]\nand signi\ufb01cantly less prone to over-\ufb01tting [10].\nFor temporal sequences, complex-valued RNNs have also been explored [1, 13, 17, 40], though\ninterest in complex representations stems from improved learning stability. In [1], norm-preserving\nstate transition matrices are used to prevent vanishing and exploding gradients. Since it is dif\ufb01cult\nto parameterize real-valued orthogonal weights, [1] recommends shifting to the complex domain,\nresulting in a unitary RNN (uRNN). The weights of the uRNN in [1], for computational ef\ufb01ciency,\nare constructed as a product of component unitary matrices. As such, they span only a reduced\nsubset of unitary matrices and do not have the expressiveness of the full set. Alternative methods of\nparameterizing the unitary matrices have been explored [13, 17, 40]. Our proposed complex gated\nRNN (cgRNN) builds on these works in that we also use unitary state transition matrices. In particular,\nwe adopt the parameterization of [40] in which weights are parameterized by full-dimensional unitary\nmatrices, though any of the other parameterizations [1, 13, 17] can also be substituted.\n\n3 Preliminaries\nWe represent a complex number z \u2208 C as z = x + ib, where x = (cid:60)(z) and y = (cid:61)(z) are the real\nand imaginary parts respectively. The complex conjugate of z is \u00afz = x \u2212 iy. In polar coordinates,\nz can be expressed as z = |z|ei\u03b8z, where |z| and \u03b8 are the magnitude and phase respectively and\n\u03b8z = atan2(x, y). Note that z1 \u00b7 z2 = |z1||z2|ei(\u03b81+\u03b82), z1 + z2 = x1 + x2 + i(y1 + y2) and\ns \u00b7 z = s \u00b7 rei\u03b8, s \u2208 R. The expression s \u00b7 z scales z\u2019s magnitude, while leaving the phase intact.\n\n2Source code available at https://github.com/v0lta/Complex-gated-recurrent-neural-networks\n\n2\n\n\f3.1 Complex Gradients\nA complex-valued function f : C \u2192 C can be expressed as f (z) = u(x, y) + iv(x, y) where u(\u00b7,\u00b7)\nand v(\u00b7,\u00b7) are two real-valued functions. The complex derivative of f (z), or the C-derivative, is\nde\ufb01ned if and only if f is holomorph. In such a case, the partial derivatives of u and v must not only\nexist but also satisfy the Cauchy-Riemann equations, where \u2202u/\u2202x = \u2202v/\u2202y and \u2202v/\u2202x = \u2212\u2202u/\u2202y.\nStrict holomorphy can be overly stringent for deep learning purposes. In fact, Liouville\u2019s theorem [25]\nstates that the only complex function which is both holomorph and bounded is a constant function.\nThis implies that for complex (activation) functions, one must trade off either boundedness or\ndifferentiability. One can forgo holomorphy and still leverage the theoretical framework of Wirtinger\nor CR-calculus [21, 27] to work separately with the R- and R- derivatives3:\n|z=const=\n\n\u2202f\n\u2202y\nBased on these derivatives, one can de\ufb01ne the chain rule for a function g(f (z)) as follows:\n\n), R-derivative (cid:44) \u2202f\n\u2202 \u00afz\n\nR-derivative (cid:44) \u2202f\n\u2202z\n\n|\u00afz=const=\n\n\u2202f\n\u2202x\n\n\u2202f\n\u2202x\n\n\u2202f\n\u2202y\n\n\u2212 i\n\n1\n2\n\n(\n\n1\n2\n\n(\n\n+ i\n\n).\n\n(1)\n\n\u2202g(f (z))\n\n\u2202g\n\u2202f\n\n\u2202f\n\u2202z\n\n\u2202g\n\u2202 \u00aff\n\n\u2202 \u00aff\n\u2202z\n\n=\n\n\u2202z\n\n(2)\nSince mappings from C \u2192 R can generally be expressed in terms of the complex variable z and its\nconjugate \u00afz, the Wirtinger-Calculus allows us to formulate and theoretically understand the gradient\nof real-valued loss functions in an easy yet principled way.\n\nwhere\n\n+\n\n\u00aff = u(x, y) \u2212 iv(x, y).\n\n3.2 A Split Complex Approach\n\nWe work with a split-complex approach, where real-valued non-linear activations are applied sepa-\nrately to the real and imaginary parts of the complex number. This makes it convenient for imple-\nmentation, since standard deep learning libraries are not designed to work natively with complex\nrepresentations. Instead, we store complex numbers as two real-valued components. Split-complex\nactivation functions process either the magnitude and phase, or the real and imaginary components\nwith two real-valued nonlinear functions and then recombine the two into a new complex quantity.\nWhile some may argue this reduces the utility of having complex representations, we prefer this to\nfully complex activations. Fully complex non-linearities do exist and may seem favorable [36], since\none needs to keep track of only the R derivatives, but due to Liouville\u2019s theorem, we must forgo\nboundedness and then deal with forward pass instabilities.\n\n4 Complex Gated RNNs\n\n4.1 Basic Complex RNN Formulation\n\nWithout any assumptions on real versus complex representations, we de\ufb01ne a basic RNN as follows:\n(3)\n(4)\nwhere xt and ht represent the input and hidden unit vectors at time t. fa is a point-wise non-linear\nactivation function, and W and V are the hidden and input state transition matrices respectively.\nIn working with complex networks, xt \u2208 Cnx\u00d71, ht \u2208 Cnh\u00d71, W \u2208 Cnh\u00d7nh, V \u2208 Cnh\u00d7nx and\nb \u2208 Cnh\u00d71, where nx and nh are the dimensionalities of the input and hidden states respectively.\n\nzt = Wht\u22121 + Vxt + b\nht = fa(zt)\n\n4.2 Complex Non-linear Activation Functions\n\nChoosing a non-linear activation function fa for complex networks can be non-trivial. Though\nholomorph non-linearities using transcendental functions have also been explored in the literature [27],\nthe presence of singularities makes them dif\ufb01cult to learn in a stable manner. Instead, bounded non-\nholomorph non-linearities tend to be favoured [11, 27], where bounded real-valued non-linearities\nare applied on the real and imaginary part separately. This also parallels the convention of using\n(bounded) tanh non-linearities in real RNNs.\n\n3For holomorph functions the R-derivative is zero and the C- derivative is equal to the R-derivative.\n\n3\n\n\fHirose\n\nModRelu\n\n|\n)\nz\n(\nf\n|\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n2\n\n|\n)\nz\n(\nf\n|\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n2\n\n1\n\n0\n\n(cid:61)(z)\n\n\u22121\n\n\u22122\n\n\u22122\n\n\u22121\n\n0\n(cid:60)(z)\n\n2\n\n1\n\n1\n\n0\n\n(cid:61)(z)\n\n\u22121\n\n\u22121\n\n0\n(cid:60)(z)\n\n\u22122\n\n\u22122\n\n2\n\n1\n\nFigure 1: Surface plots of the magnitude of the Hirose (m2 = 1) and modReLU (b =\u22120.5) activations.\n\nA common split is with respect to the magnitude and phase. This non-linearity was popularized by\nHirose [11] and scales the magnitude by a factor m2 before passing it through a tanh:\n\n(cid:18) |z|\n\n(cid:19)\n\nm2\n\n(cid:18) |z|\n\n(cid:19) z\n\nm2\n\n|z| .\n\nfHirose(z) = tanh\n\ne\u2212i\u00b7\u03b8z = tanh\n\n(5)\n\n(7)\n\nIn other areas of deep learning, the recti\ufb01ed linear unit (ReLU) is now the go-to non-linearity. In\ncomparison to sigmoid or tanh activations, they are computationally cheap, expedite convergence [22]\nand also perform better [30, 26, 42]. However, there is no direct extension into the complex domain,\nand as such, modi\ufb01ed versions have been proposed [10, 38]. The most popular is the modReLU [1] \u2013\na variation of the Hirose non-linearity, where the tanh is replaced with a ReLU and b is an offset:\n\nfmodReLU(z) = ReLU(|z| + b)e\u2212i\u00b7\u03b8z = ReLU(|z| + b)\n\nz\n|z| .\n\n(6)\n\n4.3 R \u2192 C input and C \u2192 R output mappings\nWhile several time series problems are inherently complex, especially when considering their Fourier\nrepresentations, the majority of benchmark problems in machine learning are still only de\ufb01ned in\nthe real number domain. However, one can still solve these problems with complex representations,\nsince a real z has simply a zero imaginary component, i.e.(cid:61)(z) = 0 and z = x + i \u00b7 0.\nTo map the complex state h into a real output or, we use a linear combination of the real and\nimaginary components, similar to [1], with Wo and bo as weights and offset:\n\nor = Wo[(cid:60)(h) (cid:61)(h)] + bo.\n\n4.4 Optimization on the Stiefel Manifold for Norm Preservation\n\nIn [1], it was proven that a unitary 4 W would prevent vanishing and exploding gradients of the cost\nfunction C with respect to ht, since the gradient magnitude is bounded. However, this proof hinges\non the assumption that the derivative of fa is also unity. This assumption is valid if the pre-activations\nare real and one chooses the ReLU as the non-linearity. For complex pre-activations, however, this\nis no longer a valid assumption. Neither the Hirose non-linearity (Equation 5) nor the modReLU\n(Equation 6) can guarantee stability (despite the suggestion otherwise in the original proof [1]).\nEven though it is not possible to guarantee stability, we strongly advocate using norm-preserving\nstate transition matrices, since they do still have excellent stabilizing effects. This was proven\nexperimentally in [1, 13, 40] and we \ufb01nd similar evidence in our own experiments (see Figure 2).\nEnsuring that W remains unitary during the optimization can be challenging, especially since the\ngroup of unitary matrices is not closed under addition. As such, it is not possible to learn W with\n4Since R \u2286 C, we use the term unitary to refer to both real orthogonal and complex unitary matrices and\n\nmake a distinction for clarity purposes only where necessary.\n\n4\n\n\fstandard update-based gradient descent. Alternatively, one can learn W on the Stiefel manifold [40],\nwith the k + 1 update Wk+1 given as follows by [34], where \u03bb is the learning rate, I the identity\nmatrix, and F the cost function:\n\nWk+1 = (I +\n\n\u03bb\n2\n\nAk)\u22121(I \u2212 \u03bb\n2\n\nAk)Wk\n\nwhere\n\nA = W\u2207wF\n\nT \u2212 W\n\nT\u2207wF.\n\n(8)\n\n4.5 Complex-Valued Gating Units\n\nIn keeping with the spirit that gates determine the amount of a signal to pass, we construct a complex\ngate as a Cnh\u00d7nh \u2192 Rnh\u00d71 mapping. Like in real gated RNNs, the gate is applied as an element-\nwise product, i.e.g (cid:12) h = g (cid:12) |h|ei\u03b8h. In our complex case, this type of operation results in an\nelement-wise scaling of the hidden state\u2019s magnitude. When the gate is 0, it completely resets a\nsignal, whereas if it is 1, then it ensures that the signal is passed entirely. We introduce our gates into\nthe RNN in a similar fashion as the classic GRU [4]:\n\n(cid:101)zt = W(gr (cid:12) ht\u22121) + Vxt + b,\nht = gz (cid:12) fa((cid:101)zt) + (1 \u2212 gz) (cid:12) ht\u22121,\n\n(9)\n(10)\nwhere gr and gz represent reset and update gates respectively and are de\ufb01ned with corresponding\nsubscripts r and z as\n\ngr = fg(zr),\ngz = fg(zz),\n\n(11)\n(12)\nAbove, fg denotes the gate activation, Wr \u2208 Cnh\u00d7nh and Wz \u2208 Cnh\u00d7nh denote state to state\ntransition matrices, Vr \u2208 Cnh\u00d7ni and Vz \u2208 Cnh\u00d7ni the input to state transition matrices, and\nbr \u2208 Cnh and bz \u2208 Cnh the biases. fg is a non-linear gate activation function de\ufb01ned as:\n\nzr = Wrh + Vrxt + br,\nzz = Wzh + Vzxt + bz.\n\nwhere\nwhere\n\nf mod sigmoid(z) = \u03c3(\u03b1(cid:60)(z) + \u03b2(cid:61)(z)),\n\n\u03b1, \u03b2 \u2208 [0, 1].\n\n(13)\n\nWe call this the modSigmoid and justify the choice experimentally in section 5.3.\nAs mentioned previously, even with unitary state transition matrices, this type of gating is not\nmathematically guaranteed to be stable. However, the effects of vanishing gradients are mitigated by\nthe fact that the derivatives are distributed over a sum [12, 4]. Exploding gradients are clipped.\n\n5 Experimentation\n\n5.1 Tasks & Evaluation Metrics\n\nWe test our cgRNN on two benchmark synthetic tasks: the memory problem and the adding prob-\nlem [12]. These problems are designed especially to challenge RNNs, and require the networks to\nstore information over time scales on the order of hundreds of time steps. The \ufb01rst is the memory\nproblem, where the RNN should remember n input symbols over a time period of length T + 2n\nbased on a dictionary set {s1, s2, ..., sn, sb, sd}, where s1 to sn are symbols to memorize and sb and\nsi are blank and delimiter symbols respectively. The input sequence, of length T + 2n, is composed\nof n symbols drawn randomly with replacement from {s1, ..., sn}, followed by T \u2212 1 repetitions\nof sb, sd, and another n repetitions of sb. The objective of the RNN, after being presented with the\ninitial n symbols, is to generate an output sequence of length T + S, with T repetitions of sb, and\nupon seeing sd, recall the original n input symbols. A network without memory would output sb and\nonce presented with sd, randomly predict any of the original n symbols; this results in a categorical\ncross entropy of (n + 1 log(8))/(T + 2(n + 2)). For our experiments, we choose n = 8 and T = 250.\nIn the adding problem, two sequences of length T are given as input, where the \ufb01rst sequence\nconsists of numbers randomly sampled from U[0, 1]5, while the second is an indicator sequence of\nall 0(cid:48)s and exactly two 1(cid:48)s, with the \ufb01rst 1 placed randomly in the \ufb01rst half of the sequence and the\nsecond 1 randomly in the second half. The objective of the RNN is to predict the sum of the two\nentries of the \ufb01rst input sequence once the second 1 is presented in the indicator input sequence. A\n5Note that this is a variant of [12]\u2019s original adding problem, which draws numbers from U[\u22121, 1] and used\n\nthree indicators {\u22121, 0, 1}. Our variant is consistent with state-of-the-art [1, 13, 40]\n\n5\n\n\fnaive baseline would predict 1 at every time step, regardless of the input indicator sequence\u2019s value;\nthis produces an mean squared error (MSE) of 0.167, i.e. the variance of the sum of two independent\nuniform distributions. For our experiments, we choose T = 250.\nWe apply the cgRNN to the real-world task of human motion prediction, i.e. predicting future 3D\nposes of a person given the past motion sequence. This task is of interest to diverse areas of research,\nincluding 3D tracking in computer vision [41], motion synthesis for graphics [20] as well as pose\nand action predictions for assistive robotics [19]. We follow the same experimental setting as [28],\nworking with the full Human 3.6M dataset [14]. For training, we use six of the seven actors and\ntest on actor \ufb01ve. We use the pre-processed data of [15], which converts the motion capture into\nexponential map representations of each joint. Based on an input sequence of body poses from 50\nframes, the future 10 frames are predicted. This is equivalent of predicting 400ms. The error is\nmeasured by the euclidean distance in Euler angles with respect to the ground truth poses.\nWe also test the cgRNN on native complex data drawn from the frequency domain by testing it on the\nreal world task of music transcription. Given a music wave form \ufb01le, the network should determine\nthe notes of each instrument. We use the Music-Net dataset [35], which consists of 330 classical\nmusic recordings, of which 327 are used for training and 3 are held out for testing. Each recording,\nsampled at 11kHz, is divided into segments of 2048 samples with a step size of 512 samples. The\ntranscription problem is de\ufb01ned as a multi-label classi\ufb01cation problem, where for each segment, a\nlabel vector y \u2208 0, 1128 describing the active keys in the corresponding midi \ufb01le has to be found. We\nuse the windowed Fourier-transform of each segment as network input, the real and imaginary parts\nof the Fourier transform, i.e.the odd and even components respectively, are used directly as inputs\ninto the cgRNN.\n\n5.2 RNN Implementation Details\n\nWe work in Tensor\ufb02ow, using RMS-prop to update standard weights and the multiplicative Stiefel-\nmanifold update as described Equation 8 for all unitary state transition matrices. The unitary state\ntransition matrices are initialized the same as [1] as the product of component unitary matrices. All\nother weights are initialized using the uniform initialisation method recommended in [8], i.e. U [\u2212l, l]\n\nwith l =(cid:112)6/(nin + nout), where nin and nout are the input and output dimensions of the tensor\n\nto be initialised. All biases are intialized as zero, with the exception of the gate biases br and bz,\nwhich are initialized at 4 to ensure fully open gates and linear behaviour at the start of training. All\nsynthetic tasks are run for 2 \u00b7 104 iterations with a batch-size of 50 and a constant learning rate of\n0.001 for both the RMS-Prop and the Stiefel-Manifold updates.\nFor the human motion prediction task, we adopt the state-of-the-art implementation of [28], which\nintroduces residual velocity connections into the standard GRU. Our setup shares these modi\ufb01cations;\nwe simply replace their core GRU cell with our cgRNN cell. The learning rate and batch size are\nkept the same (0.005, 16) though we reduce our state size to 512 to be compatible with [28]\u2019s 10246.\nFor music transcription, we work with a bidirectional cgRNN encoder followed by a simple cgRNN\ndecoder. All cells are set with nh = 1024; the learning rate is set to 0.0001 and batch size to 5.\n\n5.3\n\nImpact of Gating and Choice of Gating Functions\n\nWe \ufb01rst analyse the impact that gating has on the synthetic tasks by comparing our cgRNN with the\ngateless uRNN from [1]. Both networks use complex representations and also unitary state transition\nmatrices. As additional baselines, we also compare with TensorFlow\u2019s out-of-the-box GRU. We\nchoose the hidden state size nh of each network to ensure that the resulting number of parameters\nis approximately equivalent (around 44k). We \ufb01nd that our cgRNN successfully solves both the\nmemory problem as well as the adding problem. On the memory problem (see Figure 2(a), Table 1),\ngating does not play a role. Instead, having norm-preserving weight matrices is key to ensure stability\nduring the learning. The GRU, which does not have norm-preserving state matrices, is highly unstable\nand fails to solve the problem. Our cgRNN achieves very similar performance as the uRNN. This has\nto do with the fact that we initialize our gate bias term to be fully open, i.e. gr = 1, gz = 1. Under this\nsetting, the formulation is the same as the uRNN, and the unitary W dominates the cell\u2019s dynamics.\n\n6This reduction is larger than necessary \u2013 parameter-wise, the equivalent state size is\n\n6\n\n(cid:113)\n\n10242\n\n2 \u2248 724\n\n\fmemory problem\n\nadding problem\n\n0.2\n\n0.15\n\n0.1\n\ny\np\no\nr\nt\nn\ne\n\ns\ns\no\nr\nc\n\n5 \u00b7 10\u22122\n\n0.2\n\n0.15\n\ne\ns\n\nm\n\n0.1\n\n5 \u00b7 10\u22122\n\ncgRNN\nuRNN\nGRU\n\ncgRNN\nuRNN\nGRU\n\n0\n\n0\n\n0.5\n\n1\n\nupdates\n\n1.5\n\n2\u00b7104\n\n(a)\n\n0\n\n0\n\n0.5\n\n1\n\nupdates\n\n1.5\n\n2\u00b7104\n\n(b)\n\nFigure 2: Comparison of our cgRNN (blue, nh = 80) with the uRNN [1] (orange, nh = 140) and standard\nGRU [4] (green, nh = 112) on the memory (a) and adding (b) problem for T = 250. The hidden state size nh for\neach network are chosen so as to approximately match the number of parameters (approximately 44k parameters\ntotal). On the memory problem, having norm-preserving state transition matrices is critical for stable learning,\nwhile on the adding problem, having gates is important. Figure best viewed in colour.\n\nFor the adding problem, previous works [1, 13, 40] have suggested that gates are bene\ufb01cial and we\ncon\ufb01rm this result in Figure 2(b) and Table 1. We speculate that the advantage comes from the gates\nshielding the network from the irrelevant inputs of the adding problem, hence the success of our\ncgRNN as well as the GRU, but not the uRNN. Surprisingly, the standard GRU baseline, without\nany norm-preserving state transition matrices works very well on the adding problem; in fact, it\nmarginally outperforms our cgRNN. However, we believe this result does not speak to the inferiority\nof complex representations; instead, it is likely that the adding problem, as a synthetic task, is not\nable to leverage the advantages offered by the representation.\nThe gating function (Equation 13) was selected experimentally based on a systematic comparison of\nvarious functions. The performance of different gate functions are compared statistically in Table 1,\nwhere we look at the fraction of converged experiments over 20 runs as well as the mean number of\niterations required until convergence. The product as well as the tied and free weighted sum variations\nof the gating function are designed to resemble the bilinear gating mechanism used in [6]. From our\nexperiments, we \ufb01nd that it is important to scale the real and imaginary components before passing\nthrough the sigmoid to leverage the saturation constraint, and that the real and imaginary components\nshould be combined linearly. The exact weighting seems not to be important and the best performing\n\nTable 1: Comparison of gating functions on the adding and memory problems.\n\nmemory problem\n\nadding problem\n\nuRNN [40]\nproduct\ntied 1\ntied 2\nfree\n\nN\nN\nR\ng\nc\n\nfree real\n\nfrac.conv.\ngating function\n1.0\nno gate\n\u03c3((cid:60)(z))\u03c3((cid:61)(z))\n0.10\n\u03b1\u03c3((cid:60)(z)) + (1\u2212\u03b1)\u03c3((cid:61)(z))\n0.55\n\u03c3(\u03b1(cid:60)(z) + (1\u2212\u03b1)(cid:61)(z))\n0.80\n\u03c3(\u03b1(cid:60)(z) + \u03b2(cid:61)(z))\n0.75\n\u03c3(\u03b1z1 + \u03b2z2), (z1, z2) \u2208 R 0.0\n\navg.iters.\n2235\n4625\n4186\n3800\n2850\n-\n\nfrac.conv.\n0.0\n1.0\n1.0\n1.0\n1.0\n1.0\n\navg.iters.\n-\n4245\n5458\n5070\n5235\n5313\n\nThe different gates are evaluated over 20 runs by looking at the fraction of convergence (frac.conv.) and average\nnumber of iterations required for convergence (avg.iters.) if convergent. A run is considered convergent if the\nloss falls below 5\u00b710\u22127 for the memory problem and 0.01 for the adding problem. We \ufb01nd that gating has no\nimpact for the memory problem, i.e. the gateless uRNN [40] always converges, but is necessary for the adding\nproblem. All experiments use weight normalized recurrent weights, a cell size of nh = 80, and have networks\nwith approximately 44k parameters; to keep approximately the same number of parameters, we set nh = 140\nfor the uRNN and two independent gates each with nh = 90 for the real free real case.\n\n7\n\n\fmemory problem\n\nadding problem\n\n0.2\n\n0.15\n\n0.1\n\ny\np\no\nr\nt\nn\ne\n\ns\ns\no\nr\nc\n\n5 \u00b7 10\u22122\n\nnorm. modRelu\nnorm. Hirose\nnon-norm. modRelu\nnon-norm. Hirose\n\n0.2\n\n0.15\n\ne\ns\n\nm\n\n0.1\n\n5 \u00b7 10\u22122\n\n0\n\n0\n\n0.5\n\n1\n\nupdates\n\n1.5\n\n2\u00b7104\n\n(a)\n\n0\n\n0\n\n0.5\n\nnorm. modRelu\nnorm. Hirose\nnon-norm. modRelu\nnon-norm. Hirose\n\n1\n\nupdates\n\n1.5\n\n2\u00b7104\n\n(b)\n\nFigure 3: Comparison of non-linearities and norm preserving state transition matrices on the cgRNNs for the\nmemory (a) and adding (b) problems for T=250. The unbounded modReLU (see Equation 6) performs best for\nboth problems, but only if the state transition matrices are kept unitary. Without unitary state-transition matrices,\nthe bounded Hirose non-linearity (see Equation 5) performs better. We use nh = 80 for all experiments.\n\nvariants are the tied 2 and the free; to preserve generality, we advocate the use of the free variant.\nWe note that over 20 runs, our cgRNN converged only on 15-16 runs; adding the gates introduces\ninstabilities, however, we \ufb01nd the ability to solve the adding problem a reasonable trade-off.\nFinally, we compare the cgRNN to a free real variant (see last row of Table 1), which is the most\nsimilar architecture in R, i.e.normalized hidden transition matrices, same gate formulation, and two\nindependently real-valued versions of Equations 11 and 12. This real variant has similar performance\non the adding problem (for which having gates is critical), but cannot solve the memory problem.\nThis is likely due to the set of real orthogonal matrices being too restrictive, making the problem\nmore dif\ufb01cult in the real domain than the complex.\n\n5.4 Non-Linearity Choice and Norm Preservation\n\nWe compare the bounded Hirose tanh non-linearity versus the unbounded modReLU (see Section 4.2)\nin our cgRNN in Figure 3 and discover a strong interaction effect from the norm-preservation. First,\nwe \ufb01nd that optimizing on the Stiefel manifold to preserve norms for the state transition matrices\nsigni\ufb01cantly improves learning, regardless of the non-linearity. In both the memory and the adding\nproblem, keeping the state transition matrices unitary ensures faster and smoother convergence of the\nlearning curve.\nWithout unitary state transition matrices, the bounded tanh non-linearity, i.e.the conventional choice\nis better than the unbounded modReLU. However, with unitary state transition matrices, the modReLU\npulls ahead. We speculate that the modReLU, like the ReLU in the real setting, is a better choice of\nnon-linearity. The advantages afforded upon it by being unbounded, however, also makes it more\nsensitive to stability, which is why these advantages are present only when the state-transition matrices\nare kept unitary. Similar effects were observed in real RNNs in [32], in which batch normalization\nwas required in order to learn a standard RNN with the ReLU non-linearity.\n\n5.5 Real World Tasks: Human Motion Prediction & Music Transcription\n\nWe compare our cgRNN to the state of the art GRU proposed by [28] on the task of human motion\nprediction, showing the results in Table 2. Our cgRNN delivers state-of-the-art performance, while\nreducing the number of network parameters by almost 50%. However this reduction comes at the\ncost of having to compute the matrix inverse in Equation 8.\nOn the music transcription task, we are able to accurately transcribe the input signals with an accuracy\nof 53%. While this falls short of the complex convolutional state-of-the-art 72.9% of [36], their\n\n8\n\n\fcomplex convolution-based network is fundamentally different from our approach. We conclude that\nour cgRNN is able to extract meaningful information from complex valued input data and will look\ninto integrating complex convolutions into our RNN as future work.\n\nTable 2: Comparison of our cgRNN with the GRU [28] on human motion prediction.\n\ncgRNN\n\nGRU[28]\n\nAction\nwalking\neating\nsmoking\ndiscussion\ndirections\ngreeting\nphoning\nposing\npurchases\nsitting\nsitting down\ntaking photo\nwaiting\nwalking dog\nwalking together\naverage\n\n80ms\n0.29\n0.23\n0.31\n0.33\n0.41\n0.53\n0.58\n0.37\n0.61\n0.46\n0.55\n0.29\n0.35\n0.57\n0.27\n0.41\n\n160 ms\n0.48\n0.38\n0.58\n0.72\n0.65\n0.87\n1.09\n0.72\n0.86\n0.75\n1.02\n0.59\n0.68\n1.09\n0.53\n0.73\n\n320ms\n0.74\n0.66\n1.01\n1.02\n0.83\n1.26\n1.57\n1.38\n1.21\n1.22\n1.54\n0.92\n1.16\n1.45\n0.77\n1.12\n\n400ms\n0.84\n0.82\n1.1\n1.08\n0.93\n1.43\n1.72\n1.65\n1.31\n1.44\n1.73\n1.07\n1.36\n1.55\n0.86\n1.26\n\n80ms\n0.27\n0.23\n0.32\n0.31\n0.41\n0.52\n0.59\n0.64\n0.6\n0.44\n0.48\n0.29\n0.33\n0.54\n0.28\n0.42\n\n160ms\n0.47\n0.39\n0.6\n0.7\n0.65\n0.86\n1.07\n1.16\n0.82\n0.73\n0.89\n0.59\n0.65\n0.94\n0.56\n0.74\n\n320ms\n0.67\n0.62\n1.02\n1.05\n0.83\n1.30\n1.50\n1.82\n1.13\n1.21\n1.36\n0.95\n1.14\n1.32\n0.8\n1.12\n\n400ms\n0.73\n0.77\n1.13\n1.12\n0.96\n1.47\n1.67\n2.1\n1.21\n1.45\n1.57\n1.1\n1.37\n1.49\n0.88\n1.27\n\nOur cgRNN (nh = 512, 1.8M params) predicts human motions which are either comparable or slightly better\nthan the real-valued GRU [28] (nh = 1024, 3.4M params) despite having only approximately half the parameters.\n\n6 Conclusion\n\nIn this paper, we have proposed a novel complex gated recurrent unit which we use together with\nunitary state transition matrices to form a stable and fast to train recurrent neural network. To enforce\nunitarity, we optimize the state transition matrices on the Stiefel manifold, which we show to work\nwell with the modReLU. Our complex gated RNN achieves state-of-the-art performance on the\nadding problem while remaining competitive on the memory problem. We further demonstrate\nthe applicability of our network on real-world tasks. In particular, for human motion prediction\nwe achieve state-of-the-art performance while signi\ufb01cantly reducing the number of weights. The\nexperimental success of the cgRNN leads us to believe that complex representations have signi\ufb01cant\npotential and advocate for their use not only in recurrent networks but in deep learning as a whole.\nAcknowledgements: Research was supported by the DFG project YA 447/2-1 (DFG Research Unit FOR 2535\nAnticipating Human Behavior). We also gratefully acknowledge NVIDIA\u2019s donation of a Titan X Pascal GPU.\n\nReferences\n[1] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In ICML, 2016.\n\n[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nIEEE Trans. on Neural Networks, 5(2):157\u2013166, 1994.\n\n[3] N. Benvenuto and F. Piazza. On the complex backpropagation algorithm. IEEE Trans. Signal Processing,\n\n40(4):967\u2013969, 1992.\n\n[4] K. Cho, B. van Merri\u00ebnboer, \u00c7. G\u00fcl\u00e7ehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using RNN encoder\u2013decoder for statistical machine translation. In EMNLP, 2014.\n\n[5] I. Danihelka, G. Wayne, B. Uria, N. Kalchbrenner, and A. Graves. Associative long short-term memory. In\n\nICML, 2016.\n\n[6] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y.N. Dauphin. Convolutional sequence to sequence\n\nlearning. In ICML, 2017.\n\n[7] G. Georgiou and C. Koutsougeras. Complex domain backpropagation. IEEE Trans. Circuits and systems\n\nII: Analog and Digital Signal Processing, 39(5):330\u2013334, 1992.\n\n9\n\n\f[8] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nAISTATS, 2010.\n\n[9] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning. MIT press Cambridge, 2016.\n\n[10] N. Guberman. On complex valued convolutional neural networks. Technical report, The Hebrew University\n\nof Jerusalem Israel, 2016.\n\n[11] A. Hirose. Complex-valued neural networks: Advances and applications. John Wiley & Sons, 2013.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short term memory. Neural Computation, 1997.\n\n[13] S. Hyland and G. R\u00e4tsch. Learning unitary operators with help from u(n). In AAAI, 2017.\n\n[14] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive\nMethods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Analysis and Machine\nIntelligence, 36(7):1325\u20131339, jul 2014.\n\n[15] A. Jain, A.R. Zamir, S. Savarese, and A. Saxena. Structural-RNN: Deep learning on spatio-temporal\n\ngraphs. In CVPR, 2016.\n\n[16] L. Jing, C. Gulcehre, J. Peurifoy, Y. Shen, M. Tegmark, M. Solja, and Y. Bengio. Gated orthogonal\n\nrecurrent units: On learning to forget. In AAAI Workshops, 2018.\n\n[17] L. Jing, Y. Shen, T. Dub\u02c7cek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Solja\u02c7ci\u00b4c. Tunable\n\nef\ufb01cient unitary neural networks (EUNN) and their application to RNNs. In ICML, 2017.\n\n[18] T. Kim and T. Adali. Complex backpropagation neural network using elementary transcendental activation\n\nfunctions. In ICASSP, 2001.\n\n[19] H. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic\n\nresponse. IEEE Trans. Pattern Analysis and Machine Intelligence, 38(1):14\u201329, 2016.\n\n[20] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. ACM Tran. Graphics (TOG), 21(3):473\u2013482, 2002.\n\n[21] K. Kreutz-Delgado. The complex gradient operator and the CR-calculus. arXiv preprint:0906.4835, 2009.\n\n[22] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nnetworks. In NIPS, 2012.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[23] H. Leung and S. Haykin. The complex backpropagation algorithm. IEEE Trans. Signal Processing,\n\n39(9):2101\u20132104, 1991.\n\n[24] H. Li and T. Adali. Complex-valued adaptive signal processing using nonlinear functions. EURASIP\n\nJournal on Advances in Signal Processing, 2008.\n\n[25] J. Liouville. Le\u00e7ons sur les fonctions doublement p\u00e9riodiques. Journal f\u00fcr die reine und angewandte\n\nMathematik / Zeitschriftenband (1879), 1879.\n\n[26] A. Maas, A. Hannun, and A. Ng. Recti\ufb01er nonlinearities improve neural network acoustic models. In\n\nICML, 2013.\n\n[27] D.P. Mandic and V.S.L. Goh. Complex valued nonlinear adaptive \ufb01lters: noncircularity, widely linear and\n\nneural models. John Wiley & Sons, 2009.\n\n[28] J. Martinez, M.J. Black, and J. Romero. On human motion prediction using recurrent neural networks. In\n\nCVPR, 2017.\n\n[29] T. Mikolov. Statistical language models based on neural networks. Technical report, Brno University of\n\nTechnology, 2012.\n\n[30] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML, 2010.\n\n[31] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. In ICML,\n\n2013.\n\n[32] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio. Improving speech recognition by revising gated\n\nrecurrent units. In INTERSPEECH, 2017.\n\n[33] D.P. Reichert and T. Serre. Neuronal synchrony in complex-valued deep networks. In ICLR, 2014.\n\n10\n\n\f[34] H. Tagare. Notes on optimization on stiefel manifolds. Technical report, Yale University, 2011.\n\n[35] J. Thickstun, Z. Harchaoui, and S.M. Kakade. Learning features of music from scratch. In ICLR, 2017.\n\n[36] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh,\n\nY. Bengio, and C.J. Pal. Deep complex networks. In ICLR, 2018.\n\n[37] A Van Den Bos. Complex gradient and hessian. IEE Proceedings-Vision, Image and Signal Processing,\n\n141(6):380\u2013382, 1994.\n\n[38] P. Virtue, S.X. Yu, and M. Lustig. Better than real: Complex-valued neural nets for MRI \ufb01ngerprinting. In\n\nICIP, 2017.\n\n[39] W. Wirtinger. Zur formalen theorie der funktionen von mehr komplexen ver\u00e4nderlichen. Mathematische\n\nAnnalen, 1927.\n\n[40] S. Wisdom, T. Powers, J.R. Hershey, J. Le Roux, , and L. Atlas. Full-capacity unitary recurrent neural\n\nnetworks. In NIPS, 2016.\n\n[41] A. Yao, J. Gall, and L. Van Gool. Coupled action recognition and pose estimation from multiple views.\n\nInternational Journal of Computer Vision, 100(1):16\u201337, 2012.\n\n[42] M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, P. Nguyen, A. Senior, V. Vanhoucke,\n\nJ. Dean, and G. E. Hinton. On recti\ufb01ed linear units for speech processing. In ICASSP, 2013.\n\n11\n\n\f", "award": [], "sourceid": 6734, "authors": [{"given_name": "Moritz", "family_name": "Wolter", "institution": "University of Bonn"}, {"given_name": "Angela", "family_name": "Yao", "institution": "National University of Singapore"}]}